
 
Most  of  the  file  systems  used  today  use 
journaling in order to ensure file system consistency. 
This involves writing either metadata alone or both 
metadata  and  data  to  a  journal  prior  to  making 
commits to the file system itself. In the occurrence 
described  previously,  the  journal  can  be  “replayed" 
in  an  attempt  to  either  finish  committing  data  to 
disk,  or  at  least  bring  the  disk  back  to  a  previous 
consistent state, with a higher probability of success. 
Such a safety  mechanism  isn’t  free,  nor  does  it 
completely  avert  risks.  Ultimately,  the  heavier  the 
use of journalling (i.e. for  both metadata and data) 
the lower the risk of unrecoverable inconsistency, at 
the expense of performance. 
As  mentioned  previously,  ZFS  is  a  CoW  file 
system; it doesn’t ever overwrite data. Transactions 
are atomic. As a result, the on-disk format is always 
consistent, hence the lack of fsck tool for ZFS. 
The  equivalent  feature  to  journalling  that  ZFS 
has is the ZIL. However, they  function completely 
differently;  in  traditional  file  systems,  data  held  in 
RAM is typically flushed to a journal, which is then 
read when its contents is to be committed to the file 
system.  As  a  gross  oversimplification  of  the 
behaviour of ZFS, the ZIL is only ever read to replay 
transactions following a failure, with data still being 
read  from  RAM  when  committed  to  disk.  It  is 
possible  to  store  replace  the  ZIL  with  a  dedicated 
VDEV,  called  a  SLOG,  though  there  are  some 
important considerations to be made. 
A.4  Silent Corruption 
Silent  corruption  refers  to  the  corruption  of  data 
undetected by normal operations of a system and in 
some  cases  unresolvable  with  certainty.  It  is  often 
assumed  that  servergrade  hardware  is  almost 
resilient to errors, with errorcorrection code  (ECC) 
system  memory  on  top  of  common  ECC  and/or 
cyclic  redundancy  check  (CRC)  capabilities  of 
various  components  and  buses  within  the  storage 
subsystem.  However,  this  is  far  from  the  case  in 
reality. In 2007, Panzer-Steindel at CERN released a 
study  which  revealed  the  following  errors  under 
various  occurrences  and  tests  (though  the  sampled 
configurations are not mentioned): 
Disk Errors. Approximately 50 single-bit errors and 
50 sector-sized regions of corrupted data, over 
a period of five weeks of activity across 3000 
systems 
RAID-5  Verification.  Recalculation  of  parity; 
approximately 300 block problem fixes across 
492 systems over four weeks 
CASTOR  Data  Pool  Checksum  Verification. 
Approximately  “one  bad  file  in  1500  files"  in 
8.7TB  of  data,  with  an  estimated  “byte  error 
rate of 3 10 
7
" 
Conventional  RAID  and  file  system 
combinations  have  no  capabilities  in  resolving  the 
aforementioned errors. In a RAID-1 mirror, the array 
would not  be able to  determine  which copy of  the 
data  is  correct,  only  that  there  is  a  mismatch.  A 
parity  array  would  arguably  be  even  worse  in  this 
situation:  a  consistency  check  would  reveal 
mismatching  parity  blocks  based  on  parity 
recalculations using the corrupt data. 
In  this  instance,  CASTOR  (CERN  Advanced 
STORage  manager)  and  it’s  checksumming 
capability coupled  with data  replication is the only 
method  that  can  counter  silent  corruption;  if  the 
checksum of a file is miscalculated on verification, 
the  file  is  corrupt  and  can  be  rewritten  from  the 
replica.  There  are  two  disadvantages  to  this 
approach: at the time of the report’s publication, this 
validation process did not run in real-time; and this 
is a file-level functionality, meaning that the process 
of  reading  a  large  file  to  calculate  checksums  and 
rewriting  the  file  from  a  replica  if  an  error  is 
discovered,  will  be  expensive  in  terms  of  disk 
activity, as well as CPU time at a large enough scale. 
As  stated  in  A.2,  ZFS’s  on-disk  structure  is  a 
Merkle  tree,  storing  checksums  of  data  blocks  in 
parent nodes. Like CASTOR, it is possible to run a 
scrub operation to verify these checksums. However, 
ZFS automatically verifies the checksum for a block 
each  time  it  is  read  and  if  a  copy  exists  it  will 
automatically copy that block only, as opposed to an 
entire file.  
All  the  aforementioned  points  apply  to  both 
metadata  and  data.  A  crucial  difference  between  a 
conventional file system combined  with RAID and 
ZFS is that these copies, REFERENCES known as ditto 
blocks, can exist anywhere within a zpool (allowing 
for some data-level resiliency even on a single disk), 
and  can  have  up  to  three  instances.  ZFS  tries  to 
ensure ditto blocks are placed at least 1/8 of a disk 
apart as a worst case scenario. Metadata ditto blocks 
are mandatory, with ZFS increasing the replication 
count higher up the tree (these blocks have a greater 
number  of  children,  thus  are  more  critical  to 
consistency). 
Another  form  of  silent  corruption  associated 
with traditional RAID arrays is the “write hole"; the 
same  type  of  occurrence  as  outlined  above  but  on 
power failure.  In production this is rare  due to the 
use  of  uninterpretable  power  supplys  (UPSs)  to 
prevent  system  power  loss  and  RAID  controllers 
qvm: A Command Line Tool for the Provisioning of Virtual Machines
297