User Tools

Site Tools


wiki:software:dfc

The Dizwell FLAC Checker

1.0 Introduction

You presumably rip CDs to FLAC files because you care about listening to good quality digital music files: instead of throwing away a significant proportion of the audio signal to create a small MP3 file, you chose to keep the entire audio signal in FLAC format. At least, that's the reason I ripped everything I own to FLAC!

When you care about quality above all else, then, it helps to make sure that your FLAC files don't degrade over time -introduce a little bit of corruption there, a smidgen of bit-rot there… pretty soon your FLAC file will not be a bit-perfect copy of your original CD.

The FLAC encoder itself already has a means of testing a file for internal corruption: you run the command

flac -t <name of file> 

…and the encoder will de-code the file and report anything that goes wrong as it tries to do so.

The only problem with the “-t” test is that it's done on a per-file basis and consumes a relatively high amount of CPU, not to mention time. So I wanted to come up with a way of checking files in bulk and quickly -meaning, in particular, that if a file had been verified as being non-corrupt in the very recent past, it should not be subjected to another bout of CPU- and time-consuming verification.

The Dizwell FLAC Checker (DFC) is the result.

2.0 How DFC works

When the FLAC encoder creates a new FLAC file in the first place, it also computes (and stores) an MD5 hash of the audio signal component of the file. I emphasise the “audio signal” bit there: the MD5 is a hash purely of the music component of a FLAC file, not of any metadata associated with it. You can re-tag and de-tag a FLAC file a zillion times if you like, but the stored MD5 value for the actual sound part of the file will never change.

You can display the MD5 hash for the audio in a file, put there by the FLAC encoder itself, with the command:

metaflac --show-md5sum <name of file>

For example:

[[email protected] Saint Nicolas (Best)]$ metaflac --show-md5sum 01\ -\ Introduction.flac 
328b55c5b74e5cf10dd21be4d87d6bf6

Now it's also possible to compute a fresh MD5 for a FLAC file using the ffmpeg program, like so:

[[email protected] Saint Nicolas (Best)]$ ffmpeg -i "01 - Introduction.flac" -map 0:a -f md5 - 2>/dev/null | sed s/.*=//g
328b55c5b74e5cf10dd21be4d87d6bf6

When the hash returned by the ffmpeg command (which has nothing at all to do with FLAC, given that it's developed completely independently) matches the hash value which the FLAC encoder computed at the time this CD was ripped, as you see in this example, then you've got fairly good assurance that the music signal in that file today must be exactly the same music signal that existed on the day the CD was ripped.

On the other hand, if a single bit of the audio signal has changed for any reason at all, the two MD5 hashes will not agree… and then you know some change to the signal has taken place over time.

This is basically what DFC does for you. For each FLAC found in a directory structure, it checks to see when the file was last checked for “FLAC integrity”. If it was checked less than 30 days ago, it skips doing anything else with the file (thus saving CPU and time!). But if it finds the file was checked more than 30 days ago, it performs a fresh ffmpeg-based calculation of the MD5 hash for the audio in the file. If that matches the one stored in the file by FLAC when the file was first created, we're all good: the audio signal hasn't changed and the file contents must, therefore, be fine -in which case, DGC just updates the date of the last 'health check'.

If the comparison of the new MD5 hash with the original reveals that the hashes are not identical, however, then DFC generates an alert, so you know one of your files is potentially corrupt.

The results of its work are written out to a log file, so you can skip to the end of that to quickly find out how many of your FLACs were skipped (because they were checked not too long ago); validated (i.e., checked and found not to have changed audio signals); or failed (i.e., checked and their new MD5 hash doesn't match the original).

DFC won't fix any corruption it finds: that's up to you to do, using whatever tools you have at your disposal (such as re-ripping a CD, restoring from a good backup, or some other approach of your own devising). But DFC will let you know whether your bit-perfect audio collection is becoming a little less bit-perfect with time!

3.0 Obtaining and Running DFC

DFC is supplied as a Bash shell script and can be downloaded from here. Since it is only a shell script, you can open it in a text editor of your choosing and make sure it's not going to do anything untoward.

Once you've downloaded the script (say to your own Downloads directory), you need to make the file executable and (I would recommend) easily runnable. To that end, issue the following commands:

sudo mv /home/hjr/Downloads/dfc.sh /usr/bin
sudo ln -s /usr/bin/dfc.sh /usr/bin/dfc
sudo chmod +x /usr/bin/dfc.sh

The first command moves the download into the /usr/bin directory, so that the file is then in your PATH and can be invoked from anywhere simply by typing its name (rather than having to type its full path and name everytime). Instead of running the script by typing /home/hjr/Downloads/dfc.sh, therefore, you can now just type dfc.sh.

The second command creates a symbolic link to the file, using a name that lacks the “.sh” extension. So now you can invoke the script with the simple command dfc.

Only the third command is actually compulsory, though: it's the one that makes the shell script executable and thus runnable.

Once you've made the script executable, it can be run from anywhere in your file system: when you run it, you tell it the 'root' of the folder structure where you store your FLAC music files which you want checked. For example:

dfc "/multimedia/flac/hjr/classical/B/Benjamin Britten"

…which is a good example of how to invoke the program when your directories contain spaces: you wrap the entire directory name inside a pair of double-quotation marks. If you wanted to be less precise, and thus to check more files, you might instead do:

dfc /multimedia/flac/hjr/classical

…which starts 'higher up' in my storage tree hierarchy -and since there are no spaces in the directory names anywhere, no double quotes are needed.

Note that you can optionally and additionally specify a place where the log file for the run should be written to. If you miss this out, then the log file will be written to your $HOME directory. So, for example:

dfc /multimedia/flac/hjr/classical /home/hjr/logs

…shows two run-time parameters being specified. The first is the root of the FLAC file directory structure, as before: the second is where I want the log file written to. Again, if you are wanting the log to be written to somewhere that contains spaces or other special characters, wrap that second parameter in double-quotes. So, for example:

dfc /multimedia/flac/hjr/classical "/home/hjr/Logs/Flac Checker"

4.0 Example Output

When you first run the checker, you may see output similar to this:

[[email protected] ~]$ dfc "/multimedia/flac/hjr/classical/B/Benjamin Britten"
-----------------------------------------------------------------------------
          The Dizwell Flac Checker, Copyright © Howard Rogers 2019
                             Version: 1.0
  No log directory specified. Using /home/hjr instead...
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------

  Validating /Ballet/Plymouth Town (Llewellyn)/01 - Plymouth Town.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/01 - Act 1. Prelude.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/02 - The Fool and the Dwarf.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/03 - The Emperor-March.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/04 - Gavotte.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/05 - The Four Kings.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/06 - The King of the North.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/07 - The King of the East.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/08 - The King of the West.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/09 - The King of the South.flac...

…and so on.

“Validating” means the program has noted that the file needs checking (that is, it hasn't previously been checked within the past 30 days or so). It also indicates that DFC is re-computing the MD5 hash of the audio signal in the listed files, and doing the comparison of that new hash value to the old one stored within the FLAC file by the FLAC encoder itself. In short, “validating” means real work is being done.

If you re-run the program regularly, then you may see this sort of output instead:

[[email protected] Scripts]$ dfc "/multimedia/flac/hjr/classical/B/Benjamin Britten" 
-----------------------------------------------------------------------------
          The Dizwell Flac Checker, Copyright © Howard Rogers 2019
                             Version: 1.0
  No log directory specified. Using /home/hjr instead...
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------

  Skipping /Ballet/Plymouth Town (Llewellyn)/01 - Plymouth Town.flac... Last analysis done less than 30 days ago!
  Skipping /Ballet/The Prince of the Pagodas (Britten)/01 - Act 1. Prelude.flac... Last analysis done less than 30 days ago!
  Skipping /Ballet/The Prince of the Pagodas (Britten)/02 - The Fool and the Dwarf.flac... Last analysis done less than 30 days ago!
  Skipping /Ballet/The Prince of the Pagodas (Britten)/03 - The Emperor-March.flac... Last analysis done less than 30 days ago!
  Skipping /Ballet/The Prince of the Pagodas (Britten)/04 - Gavotte.flac... Last analysis done less than 30 days ago!
  Skipping /Ballet/The Prince of the Pagodas (Britten)/05 - The Four Kings.flac... Last analysis done less than 30 days ago!

“Skipping” in this context means that the tool has checked the value of the TAGDATE metadata for your file and has discovered it to be less than 30 days old. Therefore, DFC simply skips further processing for the listed file and moves on to the next; and so on. This saves decoding/testing files which were checked so recently that it's most unlikely that any of their contents have become corrupted since.

(If you ever wanted to force a re-check, regardless of when it was last done, you can edit line 32 of the shell script. Set it to a low number (say, 10) and the test will become 'was this file checked less than 10 seconds ago'… and DFC will probably decide that it does now need re-checking).

If you are very unfortunate, you may see this sort of output of the DFC tool instead:

[[email protected] Eternity's Sunrise (Goodwin)]$ /home/hjr/Scripts/dfc.sh /home/hjr/Music/
-----------------------------------------------------------------------------
          The Dizwell Flac Checker, Copyright © Howard Rogers 2019
                             Version: 1.0
  No log directory specified. Using /home/hjr instead...
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------

  Validating /Eternity's Sunrise (Goodwin)/track01.flac...
   /Eternity's Sunrise (Goodwin)/track01.flac : Stored hash is not current hash!
  Skipping /Songs (Schreier)/42 - Urians Reise um die Welt.flac... Last analysis done less than 30 days ago!
  Skipping /Tavener/01 Track01.flac... Last analysis done less than 30 days ago!
  Skipping /Tavener/02 Track02.flac... Last analysis done less than 30 days ago!
  Skipping /Tavener/03 Track03.flac... Last analysis done less than 30 days ago!
  Skipping /Tavener2/01 - Choir and Orchestra of the Academy of Ancient Music, Paul Goodwin - Eternity's Sunrise.flac... Last analysis done less than 30 days ago!
  Skipping /Vaughan Williams - A Cambridge Mass/01 - Blest pair of Sirens.flac... Last analysis done less than 30 days ago!
  Validated Files:  1
  Skipped Files:    6
  Files in error:   1

  There were errors. Please check the log!

==========================================================================================================

Here, the validation process was started on the Eternity's Sunrise file, but the program has detected that its freshly-computed MD5 hash is not the same as that stored in the file by the FLAC encoder. This means that the 'current hash' is different to the 'stored hash'. It's a sign of 'change', potentially of corruption.

Note the final message is to 'check the log': DFC may check many thousands of files during a single run, so the “real time” error that scrolls up the screen when you run DFC directly would long since have scrolled off the screen into oblivion. No matter: the count summary is shown as the last thing DFC displays, so you can immediately see a count of any files that have errors. But it will be the log that contains the specifics of which files failed the verification check. In the log file for this run, for example, you'd see this:

[[email protected] ~]$ cat DFC-201906071510.log

Starting the Dizwell Flac Checker in /home/hjr/Music/...
Validating files last checked more than 30 days ago
/Eternity's Sunrise (Goodwin)/track01.flac : Stored hash is not current hash!
Validated Files:  1
Skipped Files:    6
Files in error:   1

Note that the log file doesn't record details of which files were skipped or that correctly passed validation: it only lists the details of the files that have failed verification. The log is therefore the 'agenda' of things which need fixing (though how you go about resolving corruption in a FLAC file is beyond the scope of the DFC tool or this page!)

5.0 Other Matters

5.1 Rsync/backup Consequences

When DFC finds a file that hasn't been checked in more than 30 days, it will re-verify the file (as you'd hope). It will also modify the content of the TAGDATE metadata tag within the FLAC file, so that it correctly records the newest date at which the verification was performed. This should not be a problem as such, but be aware that it means the file contents have actually changed.

If, for example, you use rsync to copy your music files to other servers or backup disks, the modification of the TAGDATE tag will be enough to make rsync believe the file needs re-copying. Potentially, your entire music collection will therefore be re-copied (resulting in twice the usual disk consumption!)

Of course, a synchronisation of your music collection at this point would not result in excess disk consumption, since the old files on the spare disk/server would be replaced by the new ones from the freshly-validated source. But a mere copy would result in two versions of every file: the original plus the freshly-validated equivalent.

You need to think about the consequences of metadata updates for your backup strategy, in short. In my case, I run the validations against my backup copies, rather than the source. This means the source itself never looks 'new', so no fresh copying is triggered. But I still have the assurance of knowing that my backups of those music files (and, by implication, the original source of them) is corruption free, which is pretty much as good!

5.2 Scheduling

The whole point of TAGDATE is to allow you to re-run DFC frequently without triggering massive amounts of actual work: files which were validated recently will be skipped, after all.

I would therefore personally suggest scheduling DFC to run nightly: for most nights, this will mean it does practically nothing (unless you've added new music to your collection that day, of course). But once a month or so, the nightly run will trigger a re-check of most FLACs. Accordingly, here's my crontab entry for the DFC process:

0 2 * * * /usr/bin/dfc.sh "/multimedia/flac/hjr/classical" "/home/hjr/Logs"

Don't forget the two run-time parameters (source of music files, write-location of log files, respectively)

6.0 Dependencies and Limitations

DFC is intended to be run on Linux systems only. It will only work when applied to FLAC files: it does not work with MP3 files, for example (or, indeed, any other audio format you care to mention). It requires that the metaflac utility already exists on your system. Metaflac is part of the standard flac package which is often a standard component of many Linux distros or is readily installed from their standard repositories if not already installed.

DFC also requires the pre-existence of the ffmpeg utility (used to calculate MD5 hashes for a file). Again, this is either a standard component of most distros or can be easily installed from a distro's standard repositories.

Finally, DFC is a Bash script …so if you haven't installed Bash (or it's not installed already), you should install that before it will run correctly. Distros which alias Bash to some other shell may not run it correctly.

Author

DFC was devised and written by Howard Rogers ([email protected]).

License

DFC is copyright © Howard Rogers 2019, but is made available freely under the GPL v2.0 only. That license may be downloaded here.

Bugs Tracking, Feature Requests, Comments

There is no formal mechanism for reporting and tracking bugs, feature requests or general comments. But you are very welcome to email your comments to [email protected]

wiki/software/dfc.txt · Last modified: 2019/06/12 11:32 by dizwell