Comments Page - Your Disk Just Lied to You – and Your Checksums Said Everything Was Fine

« Back Your Disk Just Lied to You – and Your Checksums Said Everything Was Finemedium.comSubmitted by tanelpoder 20 hours ago

rootsudo 19 hours ago
The text feels AI generated and is unreadable to me.
_wire_ 19 hours ago
Article encourages a fix to a problem with "lost disk writes" with no investigation nor explanation as to why writes are lost.
- defrost 19 hours ago
  Generally these are writes that are lost after the kernel confirms data accepted by application write(). The causes can be hardware glitching, the difficulty lies with alerting the application that has already moved on assuming all is good.
  This in the realm of Dan Luu's fsyncgate: https://danluu.com/fsyncgate/ (discussed here: https://news.ycombinator.com/item?id=19126824 ) and "Crash Consistency" https://danluu.com/file-consistency/
  The article is about "circling back" and re-checking data blocks against checksums seperately stored rather than relying on hardware checksums and data blocks that match but are "old" (not updated when expected).
  The "why" of how such things happen is long and turgid exposition into sometimes shit happens .. and that's a book or three by itself.
- tanelpoder 17 hours ago
  The issue in the enterprise (SAN) storage space is that there are so many layers where things can go wrong. Lost writes due to the OS kernel (like the fsyncgate), in-kernel storage interaction drivers, the storage array software itself, then disk firmware etc. Theoretically you could read back the just-written block and check if it’s what you wrote, but maybe it’s returned from some cache, before the bug happened.
  Another scenario (OS, driver bugs) is that a correct block is written to a wrong location. Yes the write is persisted, but overwriting a wrong location. So now you have two incorrect blocks.
- wmf 18 hours ago
  I think only disk manufacturers can answer that but they're not willing to. Lost or "high-flying" writes have been documented for 20 years or so; we don't need to know why they occur to admit that they are real.