Compare model performance with three different transformation types

A cross-validated approach to verify the impact of a feature on a model

In this notebook, I walk you through three different transformation types for audio (wav) files for a ten-class classification problem.  In this example, I am using a vision-based algorithm; hence it is easy to visualize the importance of features from a visual perspective and their impact on model performance.

The three different transformation types are:

  • Linear Spectrograms
  • Log Spectrograms
  • Mel Spectrograms

You can learn more about these three transformations in Scott Duda's article and Ketan Doshi's writing, reasoning why Mel Spectrograms perform better in general for visual transformations of audio files.

This notebook will test these three transforms on this Urban Sounds 8K dataset and how they perform with a pre-trained vision-based model (Resnet-34) leveraging Fastaiv2.  This notebook converts these sounds to a spectrogram then uses FastAI2 code base to classify these sounds. Code and approach in this notebook

There are ten folders in this dataset as part of the data source and we will approach this as a ten-fold cross-validation for a proper comparative metric with other research papers.

About the UrbanSounds8K dataset

Urban Sounds is a dataset of 8732 labeled sounds of less than 4 seconds each from 10 classes. Dataset for UrbanSounds8K contains these 10 classes:

  1. air_conditioner
  2. car_horn
  3. children_playing
  4. dog_bark
  5. drilling
  6. engine_idling
  7. gun_shot
  8. jackhammer
  9. siren
  10. street_music

Research with this dataset as of 2019 and optimized ML approaches as of late 2019 had classification accuracy at 74% with a k-nearest neighbours (KNN) algorithm. A deep learning neural network trained from scratch obtained accuracy at 76% accuracy.

Accuracy metrics

(accuracy metrics for research article)

Setup Section

The installs and includes

You could use a non-GPU machine type for some file conversions as they are computationally expensive.  For this, I am using an ml.p3.2xlarge.  For the deep learning model, you could run it just as well on an ml.g4.2xlarge at a reduced cost.

Download files from source then uncompress

These subsequent steps ensure adequate space to get the initial files and following transformations.  I set aside 100GB.

File classification information

Note that this file provides classification information once unpacked.

slice_file_name fsID start end salience fold classID class
0 100032-3-0-0.wav 100032 0.0 0.317551 1 5 3 dog_bark
1 100263-2-0-117.wav 100263 58.5 62.500000 1 5 2 children_playing
2 100263-2-0-121.wav 100263 60.5 64.500000 1 5 2 children_playing
3 100263-2-0-126.wav 100263 63.0 67.000000 1 5 2 children_playing
4 100263-2-0-137.wav 100263 68.5 72.500000 1 5 2 children_playing

Data exploration

Class distribution across the sound types
plt.title('Class distribution in the dataset')
Text(0.5, 1.0, 'Class distribution in the dataset')

These classification files show that overall classes are represented, with a gunshot and car horns being slightly underrepresented relative to all others.

Each of the folds has a relatively similar number of wav files.

plt.ylabel('Files in each fold')
plt.title('Files in each fold')
Text(0.5, 1.0, 'Files in each fold')
Inspect the files - audio and single tranform of a audio file

Listen to a sample wav file - it happens to be a dog barking, an urban sound.


audio_file= 'UrbanSound8K/audio/fold5/100032-3-0-0.wav'   #dog bark in fold 5


Your browser does not support the audio element.

Single File Transformation

Linear Spectrogram

With the librosa module, these are the steps to convert a single wave file to a simple spectrogram.  We stay with the dog barking audio representation as a spectrogram.


samples, sample_rate = librosa.load(audio_file)
Ydb = librosa.amplitude_to_db(librosa.stft(samples), ref=sample_rate)
plt.figure(figsize=(18, 6))
librosa.display.specshow(Ydb, sr=sample_rate, x_axis='time', y_axis='linear')
/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/librosa/util/ UserWarning: amplitude_to_db was called on complex input so phase information will be discarded. To suppress this warning, call amplitude_to_db(np.abs(S)) instead.
  return f(*args, **kwargs)

<matplotlib.colorbar.Colorbar at 0x7f86eeaa0be0>
Log Spectrogram

Again, a spectrogram can have a log representation for the dog barking in log space.


plt.figure(figsize=(18, 6))
librosa.display.specshow(Ydb, sr=sample_rate, x_axis='time', y_axis='log')
<matplotlib.colorbar.Colorbar at 0x7f86eec16ac0>
Mel Spectrogram

And finally, this code transforms the same audio file into a mel-spectrogram.  Note the added level of intensity and representation in a mel-spectrogram.


S = librosa.feature.melspectrogram(y=samples, sr=sample_rate)
Sdb = librosa.power_to_db(S, ref=np.max)
plt.figure(figsize=(18, 6))
librosa.display.specshow(Sdb, sr=sample_rate, x_axis='time', y_axis='mel')
<matplotlib.colorbar.Colorbar at 0x7f86eec09910>

8K Files into Three Transformations

The below code creates folders for the transformed image files from the 8K wav files.  Note the minor change from the single-file approach for images - we can drop the axes as machines don't need these - a picture without axes works best.

This step takes a significant amount of time; it needs to be done once.


audio_path = Path('UrbanSound8K/audio/')  # un zipped source audio files are in this location as wav files
tranform_store_path = 'UrbanSoundTransforms/'  #destination folder for each transformed image state

#make initial folders once
# os.mkdir(tranform_store_path +'linear_spectrogram')
# os.mkdir(tranform_store_path +'log_spectrogram')
# os.mkdir(tranform_store_path +'mel_spectrogram')
Validate all files are transformed in destination folds

Validate that all 8K files converted into the three types of transformations

transforms = ['linear_spectrogram/','log_spectrogram/','mel_spectrogram/']
for transform in transforms:
    count = 0
    for fold in np.arange (1,11):
        count += len(list(Path(tranform_store_path+transform+str(fold)).glob('*.png')))
    print ('%s file count is %s'%(transform[:-1],count))
    assert (len(df)==count)
linear_spectrogram file count is 8732
log_spectrogram file count is 8732
mel_spectrogram file count is 8732
classes = OrderedDict(sorted(df.set_index('classID').to_dict()['class'].items()))
OrderedDict([(0, 'air_conditioner'),
             (1, 'car_horn'),
             (2, 'children_playing'),
             (3, 'dog_bark'),
             (4, 'drilling'),
             (5, 'engine_idling'),
             (6, 'gun_shot'),
             (7, 'jackhammer'),
             (8, 'siren'),
             (9, 'street_music')])
Visual inspection for each sound category

In this bit of code, we look at each of the transformed depictions across the ten-classes

fig, ax = plt.subplots(10,3, figsize=(16,16))
for k,v in classes.items():
    sample = df[df['class']==v].sample(1)
    sample_fold = sample['fold'].values[0]
    sample_file = sample['slice_file_name'].values[0].replace('wav','png')
    for transform in transforms:
        img = plt.imread(tranform_store_path+transform+str(sample_fold)+'/'+sample_file)
        ax[k][t_counter].imshow(img, aspect='equal')
        ax[k][t_counter].set_title(v+' transformed with '+ transform[:-1])

Fast AI Model Build

From the reference file where our sources were in wav format, we will change them to png for each file name and create a dictionary objection with class for a filename.

df['fname'] = df[['slice_file_name','fold']].apply (lambda x: str(x['slice_file_name'][:-4])+'.png'.strip(),axis=1 )

This labelling function uses the dictionary object and returns the class.  We drop the parts of the path and focus on the filename, which is unique across the 8K files.

my_dict = dict(zip(df.fname,df['class']))
def label_func(f_name):
    f_name = str(f_name).split('/')[-1:][0]
    return my_dict[f_name]
all_folds = list(np.arange(1,11))
all_folders = [str(i) for i in all_folds]
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
results = pd.DataFrame()

We create a list of the ten folds as strings for a downstream code in this bit of code.  A Resnet-34 model runs for three epochs for comparative analysis.  This code runs the k-fold type prediction with a single fold as a test across the three file transformation folders.  This code takes a significant amount of time and uses the GPU.

for transform in transforms:
    all_files = get_image_files(path=tranform_store_path+transform,recurse=True, folders =all_folders )
    for test_folder in all_folds:
        dblock = DataBlock(blocks=(ImageBlock,CategoryBlock),
                   get_y     = label_func,
                   splitter  = FuncSplitter(lambda s: Path(s),
        dl = dblock.dataloaders(all_files)
        print ('Train has {0} images and test has {1} images. Test is on folder {2} of transform type {3}.' .format(len(dl.train_ds),len(dl.valid_ds),test_folder,transform[:-1]))
        learn = vision_learner(dl, resnet34, metrics=accuracy)
        r = learn.validate()[test_folder,transform[:-1]] = r[1]
Prediction Results

linear_spectrogram log_spectrogram mel_spectrogram
1 0.735395 0.746850 0.750286
2 0.698198 0.730856 0.724099
3 0.678919 0.684324 0.722162
4 0.763636 0.763636 0.772727
5 0.791667 0.795940 0.887821
6 0.712029 0.765492 0.755772
7 0.717184 0.762530 0.779236
8 0.676179 0.719603 0.771712
9 0.789216 0.794118 0.833333
10 0.832736 0.835125 0.824373
linear_spectrogram log_spectrogram mel_spectrogram
count 10.000000 10.000000 10.000000
mean 0.739516 0.759847 0.782152
std 0.052835 0.042857 0.052126
min 0.676179 0.684324 0.722162
25% 0.701656 0.734854 0.751658
50% 0.726289 0.763083 0.772220
75% 0.782821 0.786961 0.813089
max 0.832736 0.835125 0.887821

Across the board, mel-spectrogram seems to outperform the other two transformations with a minor exception at the 10th fold.

We can inspect the losses in this category through the fast ai library, spot any classification issue or potentially these sounds having multiple classes, and change them.  But since this is a provided dataset, we will not change any of the categories.


interp = ClassificationInterpretation.from_learner(learn)

losses,idxs = interp.top_losses()

interp.plot_top_losses(9, figsize=(15,11))
Interpretation.plot_top_losses(k, largest=True, **kwargs)
Show `k` largest(/smallest) preds and losses. `k` may be int, list, or `range` of desired results.

interp.plot_confusion_matrix(figsize=(12,12), dpi=60)


  • Mel spectrograms tend to outperform linear and log spectrograms for human audio sounds.
  • We didn't explore further tuning the model or trying other models, but a 90% range accuracy for the mean is attainable for this dataset.
  • The 95% results in an earlier notebook are attributable to randomness from how the train-test split worked out on this model.

You can try out the mel-spectrogram trained with a random split from this hugging face link or find out how I made this.