User Tools

Site Tools


wiki:software:dtc

The Dizwell Tag Cleaner

1.0 Introduction

I've been ripping music CDs for a long time: I remember doing it for the first time in the middle of the Australian bush in around 1999. Unfortunately, back then, I decided to rip everything to MP3 (because disks were expensive!) and so I soon had to re-rip everything to FLAC once I realised how much of the true audio signal I'd thrown away in the original ripping exercise. Back then, too, I was primarily a Windows user, so I sometimes ripped to WMA Lossles, which then proved inconvenient when I started using Linux on my main desktop -so all my lossless WMAs got converted to FLAC. And then back again when I wiped whatever Linux was flavour of the month and reverted back to Windows… and so on and on.

You get the idea: ripping CDs is one step in a process of converting physical musical media to digital-only. It's the necessary first step, but it's not usually the last one and the tools we use to do any of the steps change over time, as our tastes, skills and experiences evolve. The net result is, probably, a nightmare of tagging history building up in your music files. Mine, certainly, contain a pile of nonsense I don't want and history I don't need… so I wrote a small utility that will salvage the data from the “core” metadata tags (such as Composer, Album name, Recording Date, Performers and so on) and remove anything that's non-core.

The utility is the Dizwell Tag Cleaner and you can download it from here.

2.0 Bad Examples

A common tool to use to rip CDs on Windows, for example, is the Exact Audio Copier (EAC). It does a fine bit-perfect rip of CDs, but here are the end results of one sample rip of a single test track from a CD I had handy:

metaflac --list --block-type=VORBIS_COMMENT 01\ Track01.flac 

METADATA block #2
type: 4 (VORBIS_COMMENT)
is last: false
length: 317
vendor string: reference libFLAC 1.3.1 20141125
comments: 13
  comment[0]: ARTIST=CD Artist Here
  comment[1]: TITLE=Track01
  comment[2]: ALBUM=CD Title Here
  comment[3]: DATE=2999
  comment[4]: TRACKNUMBER=01
  comment[5]: GENRE=Some Genre
  comment[6]: COMMENT=Some Comment here
  comment[7]: BAND=CD Performer Here
  comment[8]: ALBUMARTIST=CD Performer Here
  comment[9]: COMPOSER=CD Composer
  comment[10]: DISCNUMBER=1
  comment[11]: TOTALDISCS=1
  comment[12]: TOTALTRACKS=5

So the 'core' data is fine: the ARTIST tag is what you think it should be; the COMPOSER is where EAC prompts you to supply a composer's name and so on. But “BAND”? That's the name of the tag EAC creates, even though the relevant field label in the program itself is actually “CD Performer”. The same field is (redundantly!) used to populate an 'ALBUMARTIST' tag, too. EAC also says it just ripped a track from 'DISCNUMBER 1' -despite the disk number always being irrelevant when ripping Classical music (as I explain here, specifically Section 4.4).

Anyway, you get the point, I hope: the program interface prompts for some bits of data that use names that sometimes don't match the metadata tags it then goes on to create. It also creates some metadata tags which aren't prompted for at all (such as the aforementioned DISCNUMER, along with TOTALDISCS.

Here's another example, ripping a CD I have of some Tavener choral works on a Linux system using a program called Asunder:

METADATA block #2
type: 4 (VORBIS_COMMENT)
is last: false
length: 232
vendor string: reference libFLAC 1.3.2 20170101
comments: 6
  comment[0]: TRACKNUMBER=1
  comment[1]: ARTIST=Choir and Orchestra of the Academy of Ancient Music, Paul Goodwin
  comment[2]: ALBUM=John Tavener, Eternity's Sunrise
  comment[3]: TITLE=Eternity's Sunrise
  comment[4]: GENRE=Choral
  comment[5]: DATE=1900

As you can see, this program doesn't ask for much …and doesn't write very much either! Everything that's listed here is what I'd call “core”, getting to the precise matter of whose music this is, who's performing it and so on. There are no extraneous matters such as “BAND” or “TOTALDISCS”! Unfortunately, there's also no 'Composer' tag; and there's no general 'Comments' tag, which we generally use to record the performers' details. So: Asunder does a neat job, but an incomplete one… but at least it's not storing the kitchen sink in the metadata!

And as yet another example, here's the metadata from a track I remember ripping a few years back on a Windows PC:

METADATA block #1
type: 4 (VORBIS_COMMENT)
is last: false
length: 739
vendor string: reference libFLAC 1.3.1 20141125
comments: 21
  comment[0]: Title=Allegro moderato
  comment[1]: Artist=Karl Goldmark
  comment[2]: Comment=Gerard Schwarz, Seattle Symphony Orchestra, Nai-Yuan Hu (violin)
  comment[3]: Date=1995
  comment[4]: Composer=Karl Goldmark
  comment[5]: Label=Delos
  comment[6]: UPC=013491315621
  comment[7]: Album=Violin Concerto No. 1 (Hu)
  comment[8]: Genre=Concerto
  comment[9]: ALBUMARTIST=Karl Goldmark
  comment[10]: Conductor=Jack Vartoogian
  comment[11]: Orchestra=Seattle Symphony Orchestra
  comment[12]: Soloists=Nai-Yuan Hu
  comment[13]: Catalog #=3156
  comment[14]: TRACKNUMBER=1
  comment[15]: AccurateRipResult=AccurateRip: Accurate (confidence 3)   [8A7486EE]
  comment[16]: AccurateRipDiscID=006-000fde35-0052a4f0-440e2506-1
  comment[17]: Source=CD (Lossless) >> Perfect (Lossless) [wma]
  comment[18]: Encoded By=dBpoweramp Release 15.2
  comment[19]: Encoder=(FLAC 1.3.1)
  comment[20]: Encoder Settings=-compression-level-8 -verify

The core data here is fine: it's what I chose to specify as Title, Artist, Comment, Composer, Album and Track number, after all. But the Label? The Conductor? The Source? I didn't put those bits of metadata there: the dbPoweramp CD ripper for Windows (an excellent one, by the way) put it there for me, no doubt having sourced much of the information from the mostly-useless CDDB online database of CDs (and can I point out that the Conductor tag is completely wrong, anyway!)

In a way, that last example is quite touching: the 17th tag (called “Source”) shows me that I originally ripped the CD on a Windows box, because I ripped it to Lossless WMA. It's a little bit of my history, buried away inside this file's metadata!

But this is also a good example of what's wrong with the way a lot of CD rippers “presume too much”! The “label” is irrelevant to me: sure, the physical CD I ripped might have been produced by 'Delos', but since the CD is now buried in a box under a pile of carpet in the loft space and I only listen to its digitial-file cousin, the label is of no meaning. The fact that it's Catalog 3156 is also useless to me. I don't even really care about the fact that this was originally a Lossless WMA file that's been transcoded to FLAC: both are lossless formats, the audio signal will be bit-perfect in either. And so on: there's a mountain of information in this track's metadata which is of no use to me, or is simply wrong, or has no relevance to a piece of digital music

In summary, then: different ripping tools populate different amounts of metadata in a track's tags. Some of it is 'core', some of it definitely isn't. And the Dizwell Tag Cleaner will save the core and ditch the non-core.

3.0 A Good Example

So here's what that last music track's metadata looks like once the Dizwell Tag Cleaner (dtc) has been let loose on it:

METADATA block #1
type: 4 (VORBIS_COMMENT)
is last: false
length: 388
vendor string: reference libFLAC 1.3.1 20141125
comments: 11
  comment[0]: COMPOSER=Karl Goldmark
  comment[1]: ARTIST=Karl Goldmark
  comment[2]: ALBUM=Violin Concerto No. 1 (Hu)
  comment[3]: TRACKNUMBER=1
  comment[4]: TRACKTOTAL=3
  comment[5]: TITLE=Allegro moderato
  comment[6]: GENRE=Concerto
  comment[7]: COMMENT=Gerard Schwarz, Seattle Symphony Orchestra, Nai-Yuan Hu (violin)
  comment[8]: DATE=1995
  comment[9]: ENCODED-BY=Dizwell Tag Cleaner © 2019 Howard Rogers
  comment[10]: TAGDATE=1559812376

Gone are such tags as Album Artist, Label, Conductor, Catalog #, UPC, Source and so on. Instead, there are just 8 'core' tags, along with two 'extras': ENCODED-BY and TAGDATE. Anything which was stored in a tag that isn't part of the 8 core tags seen here is permanently lost. You will therefore lose details about whether you ripped the track on Windows, to WMA, to FLAC using compression level 8 or whether the CD was supplied by Decca or Deutsche Gramophon. My contention is, however, that this is fine: you didn't really need that data anyway!

Note that DTC does not do anything to Album Art metadata you may have stored within your FLAC files. If there was album art before, there will be album art afterwards. If there was no album art, DTC doesn't magic it into existence.

DTC doesn't ever invent other metadata either. If it wasn't stored as a tag for an audio file in the first place, DTC won't magically create it for you (or fetch it from Internet sources, which are always rubbish as far as Classical music is concerned!).

If you already had an 'Encoded By' tag, the 'Dizwell Tag Cleaner' will be appended to it, rather than replacing it. You may well, therefore, see a 'compound' value here, such as “Classical CD Ripper; Dizwell Tag Cleaner” or similar.

Note that DTC removes any “Album Artist” or “Original Artist” tag data that might be present and only stores the Composer name in the Composer and Artist tags. This is inline with this site's concepts of how best to tag Classical music! (See, in particular, Section 4.3 of that other article).

The new ENCODED-BY tag is there simply to let you know that your tags have been subject to 'cleaning'. The TAGDATE is actually the number of seconds since January 1st 1970, so it tells you (with a bit of computational effort!) precisely when your tags were cleaned. Since we validate the audio data as part of the cleaning process (see Section 4 below), this is also the date when your audio data was last known to be 'good'. You can then subsequently arrange for future checks of 'goodness' to be conducted based on how old your file has become since it was last checked (for example, “if the TAGDATE is from more than 3 months ago, then re-check for corruption”). The TAGDATE is not therefore useful in itself; but other tools could use it as the basis of periodic health checks of your audio files, which would be very useful!

4.0 Data Validation

For any piece of input data, an MD5 hash value can be computed from it. The specific value of that hash isn't really very important; it's the fact that if you vary the input data by so much as a single bit, then the hash value will change substantially.

In other words, if you have a hash value of “X” on January 1st, and then compute a fresh hash value of “X” on December 31st, you can say quite categorically that the data hasn't altered at all in the course of the year. That's to say, we don't care about the specific value of an MD5 hash; we simply care that it doesn't change over time. If it does, then that's a sign that the audio signal in your music file has been altered -by corruption or bit-rot, for example. Knowing that, you could then re-rip your CD or restore the 'bad' audio file from a known-good backup copy.

So, as part of its re-tagging procedures, DTC computes a completely new MD5 value for the music component of the audio file, using ffmpeg. It then compares this freshly-computed value with that stored within the file's metadata by the FLAC encoder itself when the file was first created. If the two values agree, then fine: the music component of the file hasn't altered at all between creation and re-tagging. If they don't agree, then DTC will warn you that the two hashes aren't the same: that should be regarded as a warning that internal corruption within the file has been detected. How you fix that is up to you, of course: but at least you will know that something is amiss!

Once a corruption of this sort has been detected, DTC will abort further tagging operations. You will see output like this, for example:

-----------------------------------------------------------------------------
          The Dizwell Tag Cleaner, Copyright © Howard Rogers 2019
                           Version: 1.0
-----------------------------------------------------------------------------

  Processing file...  01 - Heaven-Haven.flac
  Processing file...  02 - O deus, ego amo te.flac
  Processing file...  03 - Rosa Mystica.flac
  
  Warning: MD5 hashes don't agree for 03 - Rosa Mystica.flac: 
  Its audio signal is possibly corrupted!
  Aborting the tag cleaning process at this point.

------------------------------------------------------------------------------

This may leave the files in a folder in an inconsistent state: say there are four files, representing the four movements in a symphony. You may well end up with files 1 and 2 'cleaned', because their audio component is fine. But then a corruption in file 3 is detected, so DTC doesn't alter file 3's tags at all and simply stops. This would then leave both file 3 and 4 with the old tags, whilst files 1 and 2 have new ones. This behaviour is by design, of course: having inconsistent tagging between files is the least of your problems when you've got the audio signal corrupted in one of them!

5.0 Obtaining and Running the Software

The DTC software can be downloaded from here.. It is simply a Bash shell script, so you are free to open it in a text editor of your choice and make sure it's not going to do anything untoward or unexpected.

Once you've downloaded the script (say, to a Downloads folder), I would do the following:

sudo mv /home/hjr/Downloads/dtc.sh /usr/bin
sudo ln -s /usr/bin/dtc.sh /usr/bin/dtc
sudo chmod +x /usr/bin/dtc.sh

These commands move the script to a directory which will be in your user's PATH -that means it can then be invoked just by typing “dtc.sh” rather than having to specify the full path and filename. The second command creates a symlink so that you can invoke the program just by typing “dtc” rather than “dtc.sh”. The third command is the only one that's actually compulsory: it makes the shell script executable and capable of being invoked at all.

The program can be run from anywhere in your file system: when you run it, you tell it the 'root' of the folder structure where you store your FLAC music files.

So, for example:

[[email protected] ~]$ cd 
[[email protected] ~]$ dtc /multimedia/flac/hjr/classical

-----------------------------------------------------------------------------
          The Dizwell Tag Cleaner, Copyright © Howard Rogers 2019
                            Version: 1.01
-----------------------------------------------------------------------------

  Processing file...   /Ballet/Appalachian Spring (Copland)/01 - Very slow.flac
  Processing file...   /Ballet/Appalachian Spring (Copland)/02 - Fast.flac
  Processing file...   /Ballet/Appalachian Spring (Copland)/03 - Moderato.flac
  Already cleaned, so skipping file...  /Ballet/Appalachian Spring (Copland)/04 - Fast.flac
  
  All files processed.

-----------------------------------------------------------------------------

The program permanently alters the metadata content of your audio files, so you should start by doing just one or two 'Album's-worth' at a time until you are comfortable with what it does. (A good backup of the files you're about to alter might be an idea, too!)

In my case, I might start by running it using the command:

dtc "/multimedia/flac/hjr/classical/A/Aaron Copland"

…which makes DTC think that a specific composer's folder is now the 'root' of all music files, and therefore stops it from going off to clean every composer's music files.

That last example is significant, too, in that it shows you how to cope with a path to your music folders which contains space(s): you stick the path in double-quotes. You can also just escape the spaces with a preceding “\”, so this would work just as well:

dtc /multimedia/flac/hjr/classical/A/Aaron\ Copland

But don't use double-quotes and escape characters, because the double quotes will make it look as though the “\” character should be treated literally!

You can always check what the program has done by issuing the metaflac command I showed you before:

metaflac --list --block-type=VORBIS_COMMENT <name-and-path-of-flac-file-here>

Make sure your music player of choice continues to sort, group and display your music library files in an appropriate way after you've cleaned files, in the same way as it did before. I've checked the results in Foobar2000, Clementine, DeadBeef and Windows Media Player and am confident the cleaning process doesn't alter the way those programs display their music libraries. However, that's far from an exhaustive list of media players/organisers, so if you use something different, check how they respond to the cleaning before going full steam ahead and cleaning your entire audio library.

Finally, note that once a FLAC file has been tagged as having been cleaned with the Dizwell Tag Cleaner, it will be skipped if the DTC tool is ever run against that file again. (An example of that happening is shown for the fourth file listed in the example above). Should you ever want to re-clean your tags for a file that has been marked as already having been cleaned, you need to un-set the ENCODED-BY tag for that file. You could, for example, do that with the command:

tag=ENCODED-BY; metaflac --remove-tag=$tag <name of FLAC file here>

Once the tag has been removed from a file in this way, a subsequent run of DTC will cause the file's tags to be cleaned afresh.

6.0 Dependencies and Limitations

DTC is intended to be run on Linux systems only. It will only work when applied to FLAC files: it does not work with MP3 files, for example (or, indeed, any other audio format you care to mention). It requires that the metaflac utility already exists on your system. Metaflac is part of the standard flac package which is often a standard component of many Linux distros or is readily installed from their standard repositories if not already installed.

DTC also requires the pre-existence of the ffmpeg utility (used to calculate MD5 hashes for a file). Again, this is either a standard component of most distros or can be easily installed from a distro's standard repositories.

Finally, DTC is a Bash script …so if you haven't installed Bash (or it's not installed already), you should install that before it will run correctly. Distros which alias Bash to some other shell may not run it correctly.

Author

DTC was devised and written by Howard Rogers ([email protected]).

License

DTC is copyright © Howard Rogers 2019, but is made available freely under the GPL v2.0 only. That license may be downloaded here.

Bugs Tracking, Feature Requests, Comments

There is no formal mechanism for reporting and tracking bugs, feature requests or general comments. But you are very welcome to email your comments to [email protected]

wiki/software/dtc.txt · Last modified: 2019/06/12 11:32 by dizwell