2025-02-03 Results
First syncoid
pass
root@orcus:~# time -p syncoid --recursive --no-privilege-elevation send/test recv/test
...
real 3638.23
user 50.31
sys 2515.21
root@orcus:~#
Full log of syncoid first pass
second pass
Without stirring and as a user:
hbarta@orcus:~$ time -p syncoid --recursive --no-privilege-elevation send/test recv/test
...
real 22.10
user 4.05
sys 10.04
hbarta@orcus:~$
hbarta@orcus:~$ time -p stir_pool.sh
...
real 89.62
user 36.15
sys 34.96
hbarta@orcus:~$
post stir passes
hbarta@orcus:~$ time -p syncoid --recursive --no-privilege-elevation send/test recv/test
...
real 50.18
user 4.37
sys 13.40
hbarta@orcus:~$
Stir twice, 2nd time with stdout
and later, stderr
redirected to a log file. (Note: dd puts a lot of stuff out to stderr
)
time -p stir_pool.sh >log/$(date +%Y-%m-%d-%H%M)_stir.txt
...
real 89.62
user 36.15
sys 34.96
hbarta@orcus:~$
hbarta@orcus:~$ time -p stir_pool.sh >log/$(date +%Y-%m-%d-%H%M)_stir.txt 2>&1
real 89.30
user 36.89
sys 41.47
hbarta@orcus:~$
Repeat syncoid
hbarta@orcus:~$ time -p syncoid --recursive --no-privilege-elevation send/test recv/test
...
real 81.88
user 4.38
sys 18.49
hbarta@orcus:~$
Observations:
- It takes about 90s to stir the pool.
syncoid
takes about 50s following a single stir and 82s following two stir passes.
2025-02-04 errors produced
The first log to report errors is 2025-02-04-1618.syncoid.59.txt
and it also reports
ending incremental send/test/l0_1/l1_1/l2_1@syncoid_orcus_2025-02-04:16:12:41-GMT-06:00 ... syncoid_orcus_2025-02-04:16:18:43-GMT-06:00 (~ 20.8 MB):
warning: cannot send 'send/test/l0_1/l1_1/l2_1@autosnap_2025-02-04_22:15:50_frequently': Invalid argument
cannot receive incremental stream: most recent snapshot of recv/test/l0_1/l1_1/l2_1 does not
match incremental source
CRITICAL ERROR: zfs send -I 'send/test/l0_1/l1_1/l2_1'@'syncoid_orcus_2025-02-04:16:12:41-GMT-06:00' 'send/test/l0_1/l1_1/l2_1'@'syncoid_orcus_2025-02-04:16:18:43-GMT-06:00' | mbuffer -q -s 128k -m 16M 2>/dev/null | pv -p -t -e -r -b -s 21847088 | zfs receive -s -F 'recv/test/l0_1/l1_1/l2_1' 2>&1 failed: 256 at /sbin/syncoid line 817.
Sending incremental send/test/l0_1/l1_1/l2_2@syncoid_orcus_2025-02-04:16:12:43-GMT-06:00 ... syncoid_orcus_2025-02-04:16:18:43-GMT-06:00 (~ 21.8 MB):
And full status was
eal 59.00
user 4.86
sys 19.78
errors: List of errors unavailable: permission denied
pool: send
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:16:15 with 0 errors on Tue Feb 4 12:19:17 2025
config:
NAME STATE READ WRITE CKSUM
send ONLINE 0 0 0
wwn-0x5002538d40878f8e ONLINE 0 0 0
errors: 1 data errors, use '-v' for a list
There were a total of 197 syncoid runs and corruption was first noted on the 175th pass. The number cascaded on subsequent passes.
hbarta@orcus:~/logs$ grep "use '-v'" *syncoid*
2025-02-04-1618.syncoid.59.txt:errors: 1 data errors, use '-v' for a list
2025-02-04-1624.syncoid.53.txt:errors: 34 data errors, use '-v' for a list
2025-02-04-1630.syncoid.25.txt:errors: 74 data errors, use '-v' for a list
2025-02-04-1636.syncoid.28.txt:errors: 452 data errors, use '-v' for a list
2025-02-04-1642.syncoid.22.txt:errors: 974 data errors, use '-v' for a list
2025-02-04-1648.syncoid.25.txt:errors: 1130 data errors, use '-v' for a list
2025-02-04-1654.syncoid.25.txt:errors: 1215 data errors, use '-v' for a list
2025-02-04-1700.syncoid.26.txt:errors: 1227 data errors, use '-v' for a list
2025-02-04-1706.syncoid.24.txt:errors: 1229 data errors, use '-v' for a list
2025-02-04-1712.syncoid.25.txt:errors: 1231 data errors, use '-v' for a list
2025-02-04-1718.syncoid.24.txt:errors: 1232 data errors, use '-v' for a list
2025-02-04-1724.syncoid.24.txt:errors: 1233 data errors, use '-v' for a list
2025-02-04-1730.syncoid.23.txt:errors: 1264 data errors, use '-v' for a list
2025-02-04-1736.syncoid.25.txt:errors: 1295 data errors, use '-v' for a list
2025-02-04-1742.syncoid.24.txt:errors: 1295 data errors, use '-v' for a list
2025-02-04-1748.syncoid.25.txt:errors: 1296 data errors, use '-v' for a list
2025-02-04-1754.syncoid.25.txt:errors: 1296 data errors, use '-v' for a list
2025-02-04-1800.syncoid.26.txt:errors: 1297 data errors, use '-v' for a list
hbarta@orcus:~/logs$
Log files around the time of the first detected corruption were:
-rw-r--r-- 1 hbarta hbarta 499171 Feb 4 16:07 2025-02-04-1607.stir_pools.57.txt
-rw-r--r-- 1 hbarta hbarta 6851 Feb 4 16:12 2025-02-04-1612.syncoid.58.txt
-rw-r--r-- 1 hbarta hbarta 491105 Feb 4 16:14 2025-02-04-1614.stir_pools.58.txt
-rw-r--r-- 1 hbarta hbarta 7762 Feb 4 16:19 2025-02-04-1618.syncoid.59.txt <<< first corruption
-rw-r--r-- 1 hbarta hbarta 495515 Feb 4 16:22 2025-02-04-1621.stir_pools.59.txt
-rw-r--r-- 1 hbarta hbarta 10028 Feb 4 16:24 2025-02-04-1624.syncoid.53.txt
-rw-r--r-- 1 hbarta hbarta 493205 Feb 4 16:29 2025-02-04-1628.stir_pools.59.txt
Observations:
- Previous stir started at 1614 and lasted about a minute.
- Problem
syncoid
started at 1618, 4 minutes past the stir. - Previous scrub completed at 12:19:17 and was the last scrub performed. These were scheduled 4x daily.
sanoid
runs on fifteen minute schedule and would have completed well before these operations (including the previoussyncoid
) completed. The closest timing overlay was the stir at 1614 completing 2 seconds before sanoid ran. It seems likely that caches might still be in flight to the HDD.- Testing began at 2140 the previous day and the first corruption was noted at 1614. This is a vast improvement over "a couple days."
All operations including syncoid
, sanoid
, stir and scrub were halted as soon as corruption was noticed.