Performance with ZIL

Added by Jeff Gibson over 2 years ago

Good evening,

I have a new build I am working on that is not behaving like I would expect and wanted to get everyone else s thoughts on the matter. My questions are: Why does it appear that the ZIL does not stripe multiple devices and is it unwise to try to create a layered Pool on top of two separate pools(stripe 2x pools into 1 main pool that will then be shared over iSCSI). Read on below to see what I have attempted so far.

My setup is 2x Intel 311 20GB SLC drives for the ZIL (HPA set to only allow 2GB to be visible), 2x Intel 320 120GB MLC for the L2ARC, and 18x Toshiba 10k SAS drives for the pool. I'll skip my previous benchmarks and get to the fact that it appears that nexenta or zfs is not striping the ZIL for improved performance. Here is how I came to that conclusion:

Created a Pool: Test1 with nexenta standards except compression off for these tests ( dedup:off, compress:off, autoexpand:off, sync:standard ) with one ZIL and one SAS drive. I ran this command: dd bs=8k count=426000 if=/dev/zero of=/volumes/Test1/test.file oflag=dsync and received 29.4MB/s. I then added 3x more SAS drives and ran the same command again after any disk busy % were at 1 or 0. The result was up to 29.9MB/s. I then added a second ZIL device to the pool (no mirror) and received a speed of 29.8MB/s... So this lets me know the ZIL is now (or always was) my bottleneck (Which is only 25-35% busy according to iostat). I then removed that ZIL that was just added for the next test.

My next step was to create another pool of 1x ZIL and 4x SAS drives. I opened another session to the host and kicked off the above test and this modified one: dd bs=8k count=426000 if=/dev/zero of=/volumes/Test2/test.file oflag=dsync. Doing the napkin math of them both starting and ending within about 3s of each other I got 29.5 + 29.4 = 58.9MB/s. This number would be acceptable if I could get it in a single pool.

This brings me back to the first of my two questions; is there a way to ensure the SSD is added as a striped portion of the ZIL? Is there a better way to do this test (I had started off with iometer in a VM that was connected through iSCSI in ESX 5.0, but that was far to many layers to find where the problem was starting at)?

My second question elaborated is about tiering pools to achieve the desired write speed. What would be the optimal way to do this? Create a file based vdev that takes up the whole pool then add the 2x vdevs together in another pool? Does anyone have a reference on how this would be done (I assume command line)?

Thanks,

Jeff


Replies

RE: Performance with ZIL - Added by Dan Swartzendruber over 2 years ago

One question: the specs for the 311 show 100MB/sec sustained writes (approx), yet you are seeing 29MB/sec? What do you get writing to the 311 as a data device? e.g. remove it as zil, create a pool on it, and do a big write?

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

Here are the individual test results: Single Drive: 8k Blocks - 2087976960 bytes (2.1 GB) copied, 95.1585 seconds, 21.9 MB/s Single Drive: 128k Blocks - 2088894464 bytes (2.1 GB) copied, 23.7734 seconds, 87.9 MB/s

Two Drive: 8k Blocks - 4151443456 bytes (4.2 GB) copied, 177.12 seconds, 23.4 MB/s ^^^^ Two Drive: 128k Blocks - 4175953920 bytes (4.2 GB) copied, 42.1972 seconds, 99.0 MB/s

^^^^why am I not gaining any performance? Is zfs not striping to the pool and is instead making it a JBOD? zpool status -v ZIL2 pool: ZIL2 state: ONLINE scan: none requested config:

    NAME                     STATE     READ WRITE CKSUM
    ZIL2                     ONLINE       0     0     0
      c0t50015179595CCEE9d0  ONLINE       0     0     0
      c0t50015179595CD1B9d0  ONLINE       0     0     0

errors: No known data errors

-Jeff

RE: Performance with ZIL - Added by Dan Swartzendruber over 2 years ago

Boy that is weird. Can you run iostat (don't remember the right options offhand) to see if there are any glaring delays?)

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

cfgadm -al | grep -i 50015179595CCEE9 c26::w50015179595ccee9,0 connected configured unknown Client Device: /dev/dsk/c0t50015179595CCEE9d0s0(sd23) unavailable disk-path n /devices/pci@0,0/pci8086,340a@3/pci1000,3020@0/iport@80:scsi::w50015179595ccee9,0

cfgadm -al | grep -i 50015179595CD1B9 c25::w50015179595cd1b9,0 connected configured unknown Client Device: /dev/dsk/c0t50015179595CD1B9d0s0(sd20) unavailable disk-path n /devices/pci@0,0/pci8086,340a@3/pci1000,3020@0/iport@40:scsi::w50015179595cd1b9,0

Edit for poor formatting:

 iostat -x
                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b

sd2       1.4    0.5    5.2    0.0  0.0  0.0    0.0   0   0
sd3       2.0   63.5   19.6 7568.8  0.0  0.5    7.5   0   6
sd4       2.0   63.9   19.1 7643.7  0.0  0.5    7.9   0   7
sd5       1.4    0.5    5.2    0.0  0.0  0.0    0.0   0   0
sd6       2.0   63.5   23.2 7568.5  0.0  0.5    7.3   0   6
sd7       1.9   63.9   12.7 7644.0  0.0  0.5    7.8   0   6
sd8       1.8   63.1   15.7 7566.5  0.0  0.6    8.9   0   7
sd9       2.0   63.5   22.0 7569.1  0.0  0.5    7.5   0   6
sd10      1.7    1.1    6.3   45.5  0.0  0.0    1.1   0   0
sd11      1.6    0.6    5.7    0.0  0.0  0.0    0.0   0   0
sd12      2.0   63.5   19.6 7569.4  0.0  0.5    7.6   0   6
sd13      1.8   63.2   13.5 7568.8  0.0  0.5    7.8   0   6
sd14      0.7   19.4    3.0   95.1  0.0  0.0    0.4   0   0
sd15      2.0   63.3   22.8 7554.5  0.0  0.5    7.4   0   6
sd16      1.8   63.2   14.6 7567.8  0.0  0.5    7.8   0   6
sd17      0.8   19.5    3.1   95.1  0.0  0.0    0.4   0   0
sd18      1.9   63.3   21.6 7553.0  0.0  0.5    7.4   0   6
sd19      1.8   63.2   15.1 7569.1  0.0  0.5    7.8   0   6
sd20      1.9   81.3    6.8  831.4  0.0  0.0    0.3   0   1
sd21      1.6    0.6    5.8    0.0  0.0  0.0    0.0   0   0
sd22      1.8   63.2   13.3 7569.2  0.0  0.5    7.6   0   6
sd23      1.9  137.9    6.7 1138.5  0.0  0.0    0.2   0   1
sd24      1.9   64.5   12.8 7703.8  0.0  0.5    7.7   0   6
sd25      1.8   63.2   14.2 7569.5  0.0  0.5    8.5   0   7

Was that the output you were looking for?

RE: Performance with ZIL - Added by Dan Swartzendruber over 2 years ago

Can you run 'iostat -xn' instead?

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

Here is what you asked for:

 iostat -xn | grep -i 'c0t50015179595CCEE9d0\|c0t50015179595CD1B9d0'
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.9   89.2    6.7  906.1  0.0  0.0    0.0    0.3   0   1 c0t50015179595CD1B9d0
    1.8  144.7    6.6 1206.8  0.0  0.0    0.0    0.2   0   1 c0t50015179595CCEE9d0

I also reran dd bs=8k count=509760 if=/dev/zero of=/volumes/ZIL2/test.file oflag=dsync and got the statistics every 5s:

dd bs=8k count=509760 if=/dev/zero of=/volumes/ZIL2/test.file oflag=dsync
dd: writing `/volumes/ZIL2/test.file': No space left on device
506753+0 records in
506752+0 records out
4151312384 bytes (4.2 GB) copied, 175.785 seconds, 23.6 MB/s

iostat -zxnM 5 | grep -i 'c0t50015179595CCEE9d0\|c0t50015179595CD1B9d0'
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    1.9   79.9    0.0    0.8  0.0  0.0    0.0    0.3   0   1 c0t50015179595CD1B9d0
    1.8  135.6    0.0    1.1  0.0  0.0    0.0    0.2   0   1 c0t50015179595CCEE9d0

    0.0 3094.4    0.0   30.8  0.0  1.4    0.0    0.4   1  26 c0t50015179595CD1B9d0
    0.0 3093.4    0.0   30.9  0.0  1.4    0.0    0.5   1  26 c0t50015179595CCEE9d0

    0.2 3102.1    0.0   29.0  0.0  1.2    0.0    0.4   1  25 c0t50015179595CD1B9d0
    0.2 3097.1    0.0   29.1  0.0  1.2    0.0    0.4   1  24 c0t50015179595CCEE9d0

    0.0 3114.0    0.0   29.3  0.0  1.3    0.0    0.4   1  25 c0t50015179595CD1B9d0
    0.0 3118.8    0.0   29.6  0.0  1.2    0.0    0.4   1  25 c0t50015179595CCEE9d0

    2.2 3245.8    0.0   30.8  0.0  1.3    0.0    0.4   1  25 c0t50015179595CD1B9d0
    2.2 3248.0    0.0   30.8  0.0  1.3    0.0    0.4   1  25 c0t50015179595CCEE9d0

    0.0 2824.5    0.0   28.7  0.0  1.6    0.0    0.6   1  32 c0t50015179595CD1B9d0
    0.0 2809.1    0.0   26.5  0.0  1.1    0.0    0.4   1  21 c0t50015179595CCEE9d0

    0.2 2836.3    0.0   25.0  0.0  1.0    0.0    0.4   1  28 c0t50015179595CD1B9d0
    0.2 2856.9    0.0   26.9  0.0  1.1    0.0    0.4   1  22 c0t50015179595CCEE9d0

    0.0 2846.3    0.0   28.1  0.0  1.2    0.0    0.4   1  24 c0t50015179595CD1B9d0
    0.0 2831.1    0.0   26.3  0.0  2.1    0.0    0.7   1  32 c0t50015179595CCEE9d0

    0.2 3312.6    0.0   19.4  0.0  0.1    0.0    0.0   1  14 c0t50015179595CD1B9d0
    0.2 3328.8    0.0   21.2  0.0  0.4    0.0    0.1   1  18 c0t50015179595CCEE9d0

    2.0 3337.4    0.0   32.1  0.0  1.4    0.0    0.4   1  27 c0t50015179595CD1B9d0
    2.0 3341.8    0.0   32.1  0.0  1.3    0.0    0.4   1  26 c0t50015179595CCEE9d0

    0.2 3151.6    0.0   30.4  0.0  1.3    0.0    0.4   1  24 c0t50015179595CD1B9d0
    0.2 3163.4    0.0   30.5  0.0  1.3    0.0    0.4   1  25 c0t50015179595CCEE9d0

    0.0 3159.8    0.0   29.9  0.0  1.3    0.0    0.4   1  24 c0t50015179595CD1B9d0
    0.0 3151.0    0.0   30.0  0.0  1.3    0.0    0.4   1  24 c0t50015179595CCEE9d0

    0.2 2998.8    0.0   33.2  0.0  1.7    0.0    0.6   1  29 c0t50015179595CD1B9d0
    0.2 3001.0    0.0   32.9  0.0  1.7    0.0    0.6   1  28 c0t50015179595CCEE9d0

    2.0 3001.8    0.0   33.7  0.0  1.7    0.0    0.6   1  31 c0t50015179595CD1B9d0
    2.0 3007.8    0.0   33.9  0.0  1.7    0.0    0.6   1  28 c0t50015179595CCEE9d0

    0.2 3141.8    0.0   28.3  0.0  1.1    0.0    0.4   1  24 c0t50015179595CD1B9d0
    0.2 3124.0    0.0   27.9  0.0  1.1    0.0    0.3   1  22 c0t50015179595CCEE9d0

    0.0 3148.2    0.0   26.7  0.0  0.9    0.0    0.3   1  21 c0t50015179595CD1B9d0
    0.0 3174.2    0.0   29.0  0.0  1.1    0.0    0.4   1  23 c0t50015179595CCEE9d0

    0.2 3167.4    0.0   28.4  0.0  1.1    0.0    0.4   1  24 c0t50015179595CD1B9d0
    0.2 3140.2    0.0   26.4  0.0  0.9    0.0    0.3   1  21 c0t50015179595CCEE9d0

    2.0 2986.8    0.0   34.2  0.0  1.8    0.0    0.6   1  29 c0t50015179595CD1B9d0
    2.0 2995.8    0.0   34.2  0.0  1.8    0.0    0.6   1  29 c0t50015179595CCEE9d0

    0.2 3239.8    0.0   26.7  0.0  0.9    0.0    0.3   1  21 c0t50015179595CD1B9d0
    0.2 3244.2    0.0   26.9  0.0  0.9    0.0    0.3   1  21 c0t50015179595CCEE9d0

    0.0 2497.8    0.0   22.1  0.0  1.1    0.0    0.4   1  29 c0t50015179595CD1B9d0
    0.0 2495.6    0.0   22.1  0.0  0.8    0.0    0.3   1  18 c0t50015179595CCEE9d0

    0.2 3069.0    0.0   32.2  0.0  1.5    0.0    0.5   1  27 c0t50015179595CD1B9d0
    0.2 3057.2    0.0   32.0  0.0  1.5    0.0    0.5   1  27 c0t50015179595CCEE9d0

    0.0 2862.8    0.0   24.7  0.0  0.9    0.0    0.3   1  20 c0t50015179595CD1B9d0
    0.0 2857.2    0.0   24.7  0.0  1.5    0.0    0.5   1  28 c0t50015179595CCEE9d0

    0.2 3106.4    0.0   29.4  0.0  1.3    0.0    0.4   1  26 c0t50015179595CD1B9d0
    0.2 3100.0    0.0   29.4  0.0  1.2    0.0    0.4   1  24 c0t50015179595CCEE9d0

    0.0 3173.2    0.0   30.1  0.0  1.3    0.0    0.4   1  25 c0t50015179595CD1B9d0
    0.0 3176.0    0.0   30.1  0.0  1.2    0.0    0.4   1  25 c0t50015179595CCEE9d0

    0.2 3048.5    0.0   33.2  0.0  1.6    0.0    0.5   1  28 c0t50015179595CD1B9d0
    0.2 3053.3    0.0   33.4  0.0  1.7    0.0    0.5   1  28 c0t50015179595CCEE9d0

    0.0 3187.1    0.0   28.0  0.0  1.1    0.0    0.3   1  24 c0t50015179595CD1B9d0
    0.0 3189.9    0.0   27.8  0.0  1.0    0.0    0.3   1  22 c0t50015179595CCEE9d0

    2.2 3196.6    0.0   30.7  0.0  1.3    0.0    0.4   1  25 c0t50015179595CD1B9d0
    2.2 3199.8    0.0   31.7  0.0  1.4    0.0    0.4   1  26 c0t50015179595CCEE9d0

    0.0 3159.2    0.0   29.6  0.0  1.2    0.0    0.4   1  24 c0t50015179595CD1B9d0
    0.0 3175.0    0.0   29.7  0.0  1.2    0.0    0.4   1  24 c0t50015179595CCEE9d0

    0.2 3184.9    0.0   30.0  0.0  1.3    0.0    0.4   1  25 c0t50015179595CD1B9d0
    0.2 3183.5    0.0   29.1  0.0  1.1    0.0    0.4   1  24 c0t50015179595CCEE9d0

    0.0 3053.9    0.0   30.5  0.0  1.4    0.0    0.5   1  28 c0t50015179595CD1B9d0
    0.0 3039.9    0.0   30.4  0.0  1.4    0.0    0.4   1  25 c0t50015179595CCEE9d0

    2.2 3245.5    0.0   30.1  0.0  1.2    0.0    0.4   1  26 c0t50015179595CD1B9d0
    2.2 3254.1    0.0   30.5  0.0  1.2    0.0    0.4   1  25 c0t50015179595CCEE9d0

    0.0 3257.4    0.0   29.7  0.0  1.1    0.0    0.3   1  24 c0t50015179595CD1B9d0
    0.0 3231.2    0.0   29.5  0.0  1.1    0.0    0.3   1  24 c0t50015179595CCEE9d0

    0.2 2777.7    0.0   26.2  0.0  1.7    0.0    0.6   1  33 c0t50015179595CD1B9d0
    0.2 2760.7    0.0   26.1  0.0  1.0    0.0    0.4   1  22 c0t50015179595CCEE9d0

    0.0 3374.8    0.0   31.0  0.0  1.1    0.0    0.3   1  26 c0t50015179595CD1B9d0
    0.0 3348.4    0.0   31.4  0.0  1.1    0.0    0.3   1  26 c0t50015179595CCEE9d0

    2.2 3134.2    0.0   29.3  0.0  0.9    0.0    0.3   1  24 c0t50015179595CD1B9d0
    2.2 3099.6    0.0   28.6  0.0  1.5    0.0    0.5   1  33 c0t50015179595CCEE9d0

    0.0 5123.3    0.0   40.1  0.0  0.9    0.0    0.2   2  32 c0t50015179595CD1B9d0
    0.0 5045.1    0.0   39.4  0.0  0.9    0.0    0.2   2  32 c0t50015179595CCEE9d0

RE: Performance with ZIL - Added by Dan Swartzendruber over 2 years ago

Now I am confused. How on earth is a 4GB datafile getting 'out of space' error?

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

Oh, ignore that. That was bad math on my part. These 20gb slc drives have been short stroked to 2gb each using HPA to increase logevity (per intel docs).

-Jeff

RE: Performance with ZIL - Added by Dan Swartzendruber over 2 years ago

Once a pool fills up, the performance goes to crap, so the numbers you are measuring are not useful :( I don't understand short stroking an SSD?

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

per Intel Whitepaper for the logevity of the drive (not for performance). Unitl scsi unmap or trim is added and since partitions/slices are discouraged using the HPA allows for the drive to implicitly have 90% free space for it to self wearlevel.

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

Deleting and recreating the pool and tests rerun for 1gb file:

dd bs=8k count=131072 if=/dev/zero of=/volumes/ZIL2/test.file oflag=dsync
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB) copied, 48.4026 seconds, 22.2 MB/s

rm /volumes/ZIL2/test.file

Grew pool with second disk

dd bs=8k count=131072 if=/dev/zero of=/volumes/ZIL2/test.file oflag=dsync
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB) copied, 43.961 seconds, 24.4 MB/s

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

Redoing the above tests on the 2 disk pool

dd bs=8k count=131072 if=/dev/zero of=/volumes/ZIL2/test.file oflag=dsync
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB) copied, 42.802 seconds, 25.1 MB/s

Resulted in

iostat -zxnM 5 | grep -i 'c0t50015179595CCEE9d0\|c0t50015179595CD1B9d0'
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    1.9   92.4    0.0    0.9  0.0  0.0    0.0    0.3   0   1 c0t50015179595CD1B9d0
    1.8  141.1    0.0    1.2  0.0  0.0    0.0    0.2   0   1 c0t50015179595CCEE9d0

    4.0 3370.7    0.0   28.5  0.0  1.0    0.0    0.3   1  22 c0t50015179595CD1B9d0
    4.0 3367.5    0.0   28.5  0.0  1.0    0.0    0.3   1  22 c0t50015179595CCEE9d0

    2.2 3185.5    0.0   30.7  0.0  1.3    0.0    0.4   1  25 c0t50015179595CD1B9d0
    2.2 3189.3    0.0   30.9  0.0  1.3    0.0    0.4   1  25 c0t50015179595CCEE9d0

    2.0 2681.6    0.0   26.9  0.0  1.2    0.0    0.5   1  22 c0t50015179595CD1B9d0
    2.0 2681.2    0.0   27.1  0.0  2.5    0.0    0.9   1  36 c0t50015179595CCEE9d0

    0.2 3353.2    0.0   30.3  0.0  1.2    0.0    0.3   1  24 c0t50015179595CD1B9d0
    0.2 3340.8    0.0   28.2  0.0  1.0    0.0    0.3   1  23 c0t50015179595CCEE9d0

    2.0 3112.8    0.0   29.2  0.0  1.2    0.0    0.4   1  23 c0t50015179595CD1B9d0
    2.0 3135.8    0.0   31.8  0.0  1.4    0.0    0.5   1  27 c0t50015179595CCEE9d0

    4.2 3223.2    0.0   31.6  0.0  1.3    0.0    0.4   1  25 c0t50015179595CD1B9d0
    4.2 3204.4    0.0   29.0  0.0  1.1    0.0    0.4   1  24 c0t50015179595CCEE9d0

    2.0 3145.4    0.0   28.8  0.0  1.1    0.0    0.3   1  23 c0t50015179595CD1B9d0
    2.0 3164.4    0.0   31.1  0.0  1.4    0.0    0.4   1  26 c0t50015179595CCEE9d0

    2.2 3153.0    0.0   32.6  0.0  1.5    0.0    0.5   1  26 c0t50015179595CD1B9d0
    2.2 3140.8    0.0   29.7  0.0  1.2    0.0    0.4   1  23 c0t50015179595CCEE9d0

and

dd bs=128k count=8192 if=/dev/zero of=/volumes/ZIL2/test.file oflag=dsync
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB) copied, 9.91495 seconds, 108 MB/s
 iostat -zxnM 5 | grep -i 'c0t50015179595CCEE9d0\|c0t50015179595CD1B9d0'
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    1.9   94.3    0.0    0.9  0.0  0.0    0.0    0.3   0   1 c0t50015179595CD1B9d0
    1.8  142.8    0.0    1.2  0.0  0.0    0.0    0.2   0   1 c0t50015179595CCEE9d0

    4.0 1407.1    0.0   49.9  0.0  0.4    0.0    0.3   1  38 c0t50015179595CD1B9d0
    4.0 1414.0    0.0   49.9  0.0  0.3    0.0    0.2   1  31 c0t50015179595CCEE9d0

    2.2 1560.1    0.0   54.2  0.0  0.4    0.0    0.2   1  35 c0t50015179595CD1B9d0
    2.2 1574.5    0.0   54.4  0.0  0.4    0.0    0.2   1  33 c0t50015179595CCEE9d0

RE: Performance with ZIL - Added by Linda Kateley over 2 years ago

Jeff,

can we see zpool status -V

lk

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

zpool status -v ZIL2 pool: ZIL2 state: ONLINE scan: none requested config:

    NAME                     STATE     READ WRITE CKSUM
    ZIL2                     ONLINE       0     0     0
      c0t50015179595CD1B9d0  ONLINE       0     0     0
      c0t50015179595CCEE9d0  ONLINE       0     0     0

errors: No known data errors

RE: Performance with ZIL - Added by Dan Swartzendruber over 2 years ago

My guess is the dsync flag is killing the small writes. If you are using a zil, rather than use dsync (which I am guessing forces waits until the writes have posted to the data pool), you should be setting sync=on on the data pool and let the zil do its work.

RE: Performance with ZIL - Added by Linda Kateley over 2 years ago

I am confused that this looks like the zil is a separate pool. Zil is typically something that is added to a pool to speed sync write traffic. A zil device should be added as a log to a pool.

Can i see all of zpool status? it should show all your sas devices and the zil in one pool

RE: Performance with ZIL - Added by Dan Swartzendruber over 2 years ago

linda, you missed some context. a few posts back, i asked him to add the zil(s) as a data pool so we could test thruput to them and make sure there were no issues. but i am now thinking the dsync flag is what is killing him.

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

Linda Kateley wrote:

I am confused that this looks like the zil is a separate pool. Zil is typically something that is added to a pool to speed sync write traffic. A zil device should be added as a log to a pool.

Can i see all of zpool status? it should show all your sas devices and the zil in one pool

This was a pool of just the SSDs that we were benchmarking. Since i'm the inexperienced person around here, will you tell me what you would like to see to help diagnose why adding multiple ZIL devices (or in the recent test cases too) that performance does not scale with the number of devices?

Would you like to see what happens when i have a pool of 16 Mirrors between 1 zil device and 2?

RE: Performance with ZIL - Added by Andrew Galloway over 2 years ago

OK, hold on. There are some serious misconceptions about what a ZIL is going on in here.

ZIL == ZFS Intent Log

The ZFS Intent Log is a LOG that ZFS utilizes to guarantee data integrity while simultaneously maintaining its usual 'lazy write' style of writing data. To understand what's going on, here's a simple flow:

write comes in write goes into RAM if write is synchronous, data is pushed to ZIL and ZIL is cache flushed ZFS informs client write is done

That's the flow, essentially. There's some other caveats, but that's the gist. Note that if the data is asynchronous, the ZIL is never used.

If the zpool contains no 'log' devices, the 'ZIL' is the data disks themselves. This means that in addition to having to handle normal read I/O and the every-5-second sequential txg commit write I/O, it has to handle a ton of random write I/O WITH cache flush at the same time -- a terrible workload for spinning disks, and a burden to put on an SSD in addition to the other load. This is why we recommend dedicated SSD or RAM-backed log devices. It is not to 'improve speed' per se, it is to 'offload ZIL workload onto SSD', which has a side effect of likely improving speed.

It is also worth noting that the above flow is very simplistic -- ZFS will bypass the log device and write straight to the pool with synchronous traffic if any of a couple of conditions are met.

In addition to all of this, there is fairly sophisticated throttling that will start to hurt you if it determines the incoming workload is too much for the drives. Also, if there is a log device, and it is small, I can end up having to txg commit faster than I meant to because of it filling (and when that happens, I'm pretty sure I'm also going to start throttling in an attempt to get it back to the default of only txg committing once every 5 seconds). That that may be going on here seems possible since you show in your last update 2 dd's, one going a lot longer than the other, and in the long one speed is 29 MB/s but in the short one speed is 100 MB/s. I have a feeling that because you have the ZIL enabled but you only have effectively a 2 disk pool, it is double-dipping quite heavily and doing a ton of cache flushing. Try running your test again, but first do a 'sync=disabled' on the ZIL2 pool, see what that ends up with.

RE: Performance with ZIL - Added by Dan Swartzendruber over 2 years ago

Andrew, as I said, the only reason the zil was being used as a pool was to try to eliminate any performance issues relating to the SSDs themselves. That said, I am suspicious that the performance sucks doing a lot of small writes, and is fine doing the bigger blocks. It's the same total amount of data. I am also not sure sync=disabled will help if he is doing 'oflag=dsync'?

RE: Performance with ZIL - Added by Linda Kateley over 2 years ago

If you want to test zil functionality, you need to have sync writes. If the app is sending an async write, it will go directly to disk.

If i disable sync i am bypassing the zil. So i can see the differences between how async traffic is handled vs sync

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

OK lets start over... Here is the pool as I'd like to see it with one drive attached and the dd result for a 4GB file with an 8k blocksize (this will be hosting VMs from ESXi/VMFS5 hence the dsync to simulate writing).

zpool status -v Store
  pool: Store
 state: ONLINE
 scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        Store                       ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            c10t50000393B8037356d0  ONLINE       0     0     0
            c11t50000393B802E0DAd0  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            c12t50000393B8035F6Ad0  ONLINE       0     0     0
            c13t50000393B801B65Ad0  ONLINE       0     0     0
          mirror-2                  ONLINE       0     0     0
            c14t50000393B802DA32d0  ONLINE       0     0     0
            c15t50000393B802DA52d0  ONLINE       0     0     0
          mirror-3                  ONLINE       0     0     0
            c16t50000393B802DA96d0  ONLINE       0     0     0
            c17t50000393B802D9F6d0  ONLINE       0     0     0
          mirror-4                  ONLINE       0     0     0
            c1t50000393B803ACD6d0   ONLINE       0     0     0
            c2t50000393B8037386d0   ONLINE       0     0     0
          mirror-5                  ONLINE       0     0     0
            c3t50000393B8037336d0   ONLINE       0     0     0
            c4t50000393B802E092d0   ONLINE       0     0     0
          mirror-6                  ONLINE       0     0     0
            c6t50000393B802DA92d0   ONLINE       0     0     0
            c7t50000393B802E0A6d0   ONLINE       0     0     0
          mirror-7                  ONLINE       0     0     0
            c8t50000393B802DA06d0   ONLINE       0     0     0
            c9t50000393B802D9E6d0   ONLINE       0     0     0
        logs
          c0t50015179595CCEE9d0     ONLINE       0     0     0
        cache
          c0t5001517A6BE6B48Bd0     ONLINE       0     0     0
          c0t5001517A6BE7289Ed0     ONLINE       0     0     0
        spares
          c18t50000393B8035ABAd0    AVAIL
          c5t50000393B802DA2Ed0     AVAIL

errors: No known data errors
dd bs=8k count=524288 if=/dev/zero of=/volumes/Store/test.file oflag=dsync
524288+0 records in
524288+0 records out
4294967296 bytes (4.3 GB) copied, 154.41 seconds, 27.8 MB/s
root@BearCreekSan:/export/home/admin# rm /volumes/Store/test.file
root@BearCreekSan:/export/home/admin# dd bs=128k count=32768 if=/dev/zero of=/volumes/Store/test.file oflag=dsync
32768+0 records in
32768+0 records out
4294967296 bytes (4.3 GB) copied, 55.2498 seconds, 77.7 MB/s

I then add the second zil (and would expect an increase in performance since the zil is bottlenecked at around 30MB/s on 8k writes)

zpool status -v Store
  pool: Store
 state: ONLINE
 scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        Store                       ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            c10t50000393B8037356d0  ONLINE       0     0     0
            c11t50000393B802E0DAd0  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            c12t50000393B8035F6Ad0  ONLINE       0     0     0
            c13t50000393B801B65Ad0  ONLINE       0     0     0
          mirror-2                  ONLINE       0     0     0
            c14t50000393B802DA32d0  ONLINE       0     0     0
            c15t50000393B802DA52d0  ONLINE       0     0     0
          mirror-3                  ONLINE       0     0     0
            c16t50000393B802DA96d0  ONLINE       0     0     0
            c17t50000393B802D9F6d0  ONLINE       0     0     0
          mirror-4                  ONLINE       0     0     0
            c1t50000393B803ACD6d0   ONLINE       0     0     0
            c2t50000393B8037386d0   ONLINE       0     0     0
          mirror-5                  ONLINE       0     0     0
            c3t50000393B8037336d0   ONLINE       0     0     0
            c4t50000393B802E092d0   ONLINE       0     0     0
          mirror-6                  ONLINE       0     0     0
            c6t50000393B802DA92d0   ONLINE       0     0     0
            c7t50000393B802E0A6d0   ONLINE       0     0     0
          mirror-7                  ONLINE       0     0     0
            c8t50000393B802DA06d0   ONLINE       0     0     0
            c9t50000393B802D9E6d0   ONLINE       0     0     0
        logs
          c0t50015179595CCEE9d0     ONLINE       0     0     0
          c0t50015179595CD1B9d0     ONLINE       0     0     0
        cache
          c0t5001517A6BE6B48Bd0     ONLINE       0     0     0
          c0t5001517A6BE7289Ed0     ONLINE       0     0     0
        spares
          c18t50000393B8035ABAd0    AVAIL
          c5t50000393B802DA2Ed0     AVAIL

errors: No known data errors

And rerun the tests

dd bs=8k count=524288 if=/dev/zero of=/volumes/Store/test.file oflag=dsync
524288+0 records in
524288+0 records out
4294967296 bytes (4.3 GB) copied, 152.843 seconds, 28.1 MB/s
root@BearCreekSan:/export/home/admin# rm /volumes/Store/test.file
root@BearCreekSan:/export/home/admin# dd bs=128k count=32768 if=/dev/zero of=/volumes/Store/test.file oflag=dsync
32768+0 records in
32768+0 records out
4294967296 bytes (4.3 GB) copied, 28.499 seconds, 151 MB/s

Now I like that the 128k file performance increased, but I would like to see the 8k performance at least improve more than 0.3MB/s.

On the throttling comment, correct me if I'm wrong, but wouldn't throttling kick in at a lower speed if I only had one disk vs four or eight that I'm trying to implement?

RE: Performance with ZIL - Added by Dan Swartzendruber over 2 years ago

Not sure who you were responding to, Linda. I was testing some of this same stuff. I think if he puts sync=always on the data pool, he will be testing the ZIL regardless of oflag=dsync.

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

For completeness I set Sync=always, deleted the testfile and reran:

dd bs=8k count=524288 if=/dev/zero of=/volumes/Store/test.file
524288+0 records in
524288+0 records out
4294967296 bytes (4.3 GB) copied, 155.42 seconds, 27.6 MB/s

-Jeff

RE: Performance with ZIL - Added by Dan Swartzendruber over 2 years ago

I am seeing the same performance anomaly with small writes vs big ones. My setup: 3x2 mirrored 640GB WD blue drives. I created a test dataset on the pool and set sync=always. I have a 300GB 15k seagate SAS drive I was using for cache - removed it and added as zil. When doing 8K writes, I get absolutely abysmal performance, whereas a 128K block write size got me an 8X improvement.

RE: Performance with ZIL - Added by Linda Kateley over 2 years ago

got it, thanks

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

Dan Swartzendruber wrote:

I am seeing the same performance anomaly with small writes vs big ones. My setup: 3x2 mirrored 640GB WD blue drives. I created a test dataset on the pool and set sync=always. I have a 300GB 15k seagate SAS drive I was using for cache - removed it and added as zil. When doing 8K writes, I get absolutely abysmal performance, whereas a 128K block write size got me an 8X improvement.

I don't miss the point that small writes to spinning media is not going to be fast, my point is that with a single spinning drive and my SSD zil lets me write at 30MB/s. To me this means either the SSD is the bottleneck or the flushing to the disk is. The next step I took was to increase the number of disks in the pool to 4 (no mirrors, just more disks) and was unable to get any more than 30MB/s. I then added all of the disks to the pool, again no mirrors (this makes it 18 RAID0 10k sas drives) and was still locked at 30MB/s. This makes me believe that throttling due to disk flush delay is not the bottleneck. So I then add the extra SSD as another log (again not mirrored) and I still have a 30MB/s bottleneck. It's this last fact that is baffling to me. I then created two new pools with a split of 1 log and 9 disks. I ran the same test again against both pools at the same time and was able to get roughly 60MB/s. This leads me to believe the system itself can handle more throughput as can the underlying disks. It just seems like ZFS doesn't want to use that bandwidth...

Can anyone suggest a different way to setup or test to find what the bottleneck of the system is?

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

I forgot to add this earlier and have since destroyed the pool for further testing, but here was what I tried: I created 2 pools with 3x3 RaidZ and 2x3RaidZ with each pool having one SSD Zil. I tested each pool and was able to get about 30MB/s each and at the same time. I then used mkfile to create a container that took up about 85% of the storage. I then used zpool create test /Pool1/pool1.store /Pool2/pool2.store to combine those two files into a new singular pool. I then used the same dd tests as above and was only able to get about 25-30MB/s to the combined pool. It looks like combining them added even more overhead then actually increasing performance.

I don't know if this adds any extra information to the problem or not, but right now I'm just supplying as many data points as possible.

RE: Performance with ZIL - Added by Dan Swartzendruber over 2 years ago

Jeff. you are missing my point. I wasn't implying this has anything whatsoever to do with small writes to spinning media. Clearly though, there is some kind of tuning issue (or something), that is crippling small writes, even with the ZIL.

RE: Performance with ZIL - Added by Linda Kateley over 2 years ago

If you are open to trying, i would like to see what would happen if you put all the disks in one big old pool.. just striped and add the zil, but leave off the cache...

RE: Performance with ZIL - Added by Linda Kateley over 2 years ago

also it would be nice to see if you could get some data locally. i'm starting to think network is bottleneck.

you can login to the box as admin or root and run the same commands.

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

Linda Kateley wrote:

If you are open to trying, i would like to see what would happen if you put all the disks in one big old pool.. just striped and add the zil, but leave off the cache...

zpool status -v Test
  pool: Test
 state: ONLINE
 scan: none requested
config:

        NAME                      STATE     READ WRITE CKSUM
        Test                      ONLINE       0     0     0
          c10t50000393B8037356d0  ONLINE       0     0     0
          c11t50000393B802E0DAd0  ONLINE       0     0     0
          c12t50000393B8035F6Ad0  ONLINE       0     0     0
          c13t50000393B801B65Ad0  ONLINE       0     0     0
          c14t50000393B802DA32d0  ONLINE       0     0     0
          c15t50000393B802DA52d0  ONLINE       0     0     0
          c16t50000393B802DA96d0  ONLINE       0     0     0
          c17t50000393B802D9F6d0  ONLINE       0     0     0
          c18t50000393B8035ABAd0  ONLINE       0     0     0
          c1t50000393B803ACD6d0   ONLINE       0     0     0
          c2t50000393B8037386d0   ONLINE       0     0     0
          c3t50000393B8037336d0   ONLINE       0     0     0
          c4t50000393B802E092d0   ONLINE       0     0     0
          c5t50000393B802DA2Ed0   ONLINE       0     0     0
          c6t50000393B802DA92d0   ONLINE       0     0     0
          c7t50000393B802E0A6d0   ONLINE       0     0     0
          c8t50000393B802DA06d0   ONLINE       0     0     0
          c9t50000393B802D9E6d0   ONLINE       0     0     0
        logs
          c0t50015179595CCEE9d0   ONLINE       0     0     0
          c0t50015179595CD1B9d0   ONLINE       0     0     0

errors: No known data errors

Like so?

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

Linda Kateley wrote:

also it would be nice to see if you could get some data locally. i'm starting to think network is bottleneck.

you can login to the box as admin or root and run the same commands.

You believe that my SSHing to the box is impacting my dd benchmarks? I'll try the commands locally, but i don't think that is an issue.

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

I ran the same dd benchmarks at the local console and got the following: 8k - 4.3GB in 143s for 29.9MB/s 128k - 4.3GB in 27s for 157MB/s

-Jeff

RE: Performance with ZIL - Added by Andrew Galloway over 2 years ago

Ok, so first, I agree, there's an issue here. The catch is we need to identify it and nothing going on here is actually removing those SSD's from the equation, and they're what I think is the problem. To eliminate them, we need to remove them from the equation momentarily.

What needs to happen is you need to create the pool like you want it (you can leave the SSD's out or not, sync=disabled is going to effectively not use them anyway), set sync=disabled, and run a dd (with no dsync option, not that it will matter with sync=disabled). I want to see what THAT performs like.

Is it possible for me to get access to this machine, temporarily? You can email me at andrew dot galloway at nexenta dot com.

  • Andrew

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

Andrew Galloway wrote:

Ok, so first, I agree, there's an issue here. The catch is we need to identify it and nothing going on here is actually removing those SSD's from the equation, and they're what I think is the problem. To eliminate them, we need to remove them from the equation momentarily.

What needs to happen is you need to create the pool like you want it (you can leave the SSD's out or not, sync=disabled is going to effectively not use them anyway), set sync=disabled, and run a dd (with no dsync option, not that it will matter with sync=disabled). I want to see what THAT performs like.

Is it possible for me to get access to this machine, temporarily? You can email me at andrew dot galloway at nexenta dot com.

  • Andrew
zpool status -v Store
  pool: Store
 state: ONLINE
 scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        Store                       ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            c10t50000393B8037356d0  ONLINE       0     0     0
            c11t50000393B802E0DAd0  ONLINE       0     0     0
            c12t50000393B8035F6Ad0  ONLINE       0     0     0
          raidz1-1                  ONLINE       0     0     0
            c13t50000393B801B65Ad0  ONLINE       0     0     0
            c14t50000393B802DA32d0  ONLINE       0     0     0
            c15t50000393B802DA52d0  ONLINE       0     0     0
          raidz1-2                  ONLINE       0     0     0
            c16t50000393B802DA96d0  ONLINE       0     0     0
            c17t50000393B802D9F6d0  ONLINE       0     0     0
            c18t50000393B8035ABAd0  ONLINE       0     0     0
          raidz1-3                  ONLINE       0     0     0
            c1t50000393B803ACD6d0   ONLINE       0     0     0
            c2t50000393B8037386d0   ONLINE       0     0     0
            c3t50000393B8037336d0   ONLINE       0     0     0
          raidz1-4                  ONLINE       0     0     0
            c4t50000393B802E092d0   ONLINE       0     0     0
            c5t50000393B802DA2Ed0   ONLINE       0     0     0
            c6t50000393B802DA92d0   ONLINE       0     0     0

errors: No known data errors

Changed dd to use 2x system memory or ~144GB for this test after the first one finished so fast:

 dd bs=8k count=524288 if=/dev/zero of=/volumes/Store/test.file
524288+0 records in
524288+0 records out
4294967296 bytes (4.3 GB) copied, 7.66428 seconds, 560 MB/s
root@BearCreekSAN:/export/home/admin# dd bs=8k count=18874368 if=/dev/zero of=/volumes/Store/test.file
18874368+0 records in
18874368+0 records out
154618822656 bytes (155 GB) copied, 305.462 seconds, 506 MB/s

caching works well on the system...

RE: Performance with ZIL - Added by Linda Kateley over 2 years ago

yes, so now we know it's network.

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

Linda Kateley wrote:

yes, so now we know it's network.

I am really confused by this. No network protocols were used at any stage of testing except to ssh to the machine and issue local dd commands. Could you elaborate how you came to this conclusion so that I can learn for the future?

RE: Performance with ZIL - Added by Linda Kateley over 2 years ago

Sorry that was from earlier in the thread. Actually i have like 4 performance threads on the boards right now...trying to keep up :)

The way to test is to test the workload with 1 zil drive and then with 2 as a frontend to the pool.

linda

RE: Performance with ZIL - Added by Jeff Gibson over 2 years ago

So I've got a question for you Andrew and team. Since the number of IOPs a drive can perform is directly proportional to it's write response time isn't it true that the only way the ZIL will will ever scale is by using devices that ignore the sync request and return faster than it actually writes?

Content-Type: text/html; charset=utf-8 Set-Cookie: _redmine_session=BAh7BiIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNoSGFzaHsABjoKQHVzZWR7AA%3D%3D--cebfb08d300a85bd88dafd1422210ebe7c9a5873; path=/; HttpOnly Status: 500 Internal Server Error X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 2.0.3 ETag: "ff653faa3e28b0c6432f6d844ed17b84" X-Runtime: 1565ms Content-Length: 61253 Cache-Control: private, max-age=0, must-revalidate redMine 500 error

Internal error

An error occurred on the page you were trying to access.
If you continue to experience problems please contact your redMine administrator for assistance.

Back