Compare model performance with three different transformation types
A cross-validated approach to verify the impact of a feature on a model
- About the UrbanSounds8K dataset
- Setup Section
- Data exploration
- Single File Transformation
- 8K Files into Three Transformations
- Fast AI Model Build
- Prediction Results
- Summary
In this notebook, I walk you through three different transformation types for audio (wav) files for a ten-class classification problem. In this example, I am using a vision-based algorithm; hence it is easy to visualize the importance of features from a visual perspective and their impact on model performance.
The three different transformation types are:
- Linear Spectrograms
- Log Spectrograms
- Mel Spectrograms
You can learn more about these three transformations in Scott Duda's article and Ketan Doshi's writing, reasoning why Mel Spectrograms perform better in general for visual transformations of audio files.
This notebook will test these three transforms on this Urban Sounds 8K dataset and how they perform with a pre-trained vision-based model (Resnet-34) leveraging Fastaiv2. This notebook converts these sounds to a spectrogram then uses FastAI2 code base to classify these sounds. Code and approach in this notebook
There are ten folders in this dataset as part of the data source and we will approach this as a ten-fold cross-validation for a proper comparative metric with other research papers.
Urban Sounds is a dataset of 8732 labeled sounds of less than 4 seconds each from 10 classes. Dataset for UrbanSounds8K contains these 10 classes:
- air_conditioner
- car_horn
- children_playing
- dog_bark
- drilling
- engine_idling
- gun_shot
- jackhammer
- siren
- street_music
Research with this dataset as of 2019 and optimized ML approaches as of late 2019 had classification accuracy at 74% with a k-nearest neighbours (KNN) algorithm. A deep learning neural network trained from scratch obtained accuracy at 76% accuracy.
(accuracy metrics for research article)
You could use a non-GPU machine type for some file conversions as they are computationally expensive. For this, I am using an ml.p3.2xlarge. For the deep learning model, you could run it just as well on an ml.g4.2xlarge at a reduced cost.
#One time installs - On AWS useconda_pytorch_p38 environment and add using ml.p3.2xlarge for this notebook
# !pip install librosa
# !pip install fastbook
#all the one time imports for this nb
import pandas as pd
from fastai.vision.all import *
from fastai.data.all import *
import matplotlib.pyplot as plt
from matplotlib.pyplot import specgram
import librosa
import librosa.display
import numpy as np
from pathlib import Path
import os
import random
import IPython
from tqdm import tqdm
from collections import OrderedDict
These subsequent steps ensure adequate space to get the initial files and following transformations. I set aside 100GB.
# One time download files to local S3 folder
# !wget https://goo.gl/8hY5ER #download
# !tar xf 8hY5ER #unpack tar file
Note that this file provides classification information once unpacked.
df = pd.read_csv('UrbanSound8K/metadata/UrbanSound8K.csv') #classification information across folds as provided from Urbansounds
df.head()
df.groupby('class').classID.count().sort_values(ascending=False).plot.bar()
plt.ylabel('count')
plt.title('Class distribution in the dataset')
These classification files show that overall classes are represented, with a gunshot and car horns being slightly underrepresented relative to all others.
Each of the folds has a relatively similar number of wav files.
df.groupby(['fold']).classID.count().sort_values(ascending=False).plot.bar()
plt.ylabel('Files in each fold')
plt.title('Files in each fold')
Listen to a sample wav file - it happens to be a dog barking, an urban sound.
audio_file= 'UrbanSound8K/audio/fold5/100032-3-0-0.wav' #dog bark in fold 5
IPython.display.Audio(audio_file)
With the librosa module, these are the steps to convert a single wave file to a simple spectrogram. We stay with the dog barking audio representation as a spectrogram.
samples, sample_rate = librosa.load(audio_file)
Ydb = librosa.amplitude_to_db(librosa.stft(samples), ref=sample_rate)
plt.figure(figsize=(18, 6))
librosa.display.specshow(Ydb, sr=sample_rate, x_axis='time', y_axis='linear')
plt.colorbar()
Again, a spectrogram can have a log representation for the dog barking in log space.
plt.figure(figsize=(18, 6))
librosa.display.specshow(Ydb, sr=sample_rate, x_axis='time', y_axis='log')
plt.colorbar()
And finally, this code transforms the same audio file into a mel-spectrogram. Note the added level of intensity and representation in a mel-spectrogram.
S = librosa.feature.melspectrogram(y=samples, sr=sample_rate)
Sdb = librosa.power_to_db(S, ref=np.max)
plt.figure(figsize=(18, 6))
librosa.display.specshow(Sdb, sr=sample_rate, x_axis='time', y_axis='mel')
plt.colorbar()
The below code creates folders for the transformed image files from the 8K wav files. Note the minor change from the single-file approach for images - we can drop the axes as machines don't need these - a picture without axes works best.
This step takes a significant amount of time; it needs to be done once.
audio_path = Path('UrbanSound8K/audio/') # un zipped source audio files are in this location as wav files
tranform_store_path = 'UrbanSoundTransforms/' #destination folder for each transformed image state
#make initial folders once
#os.mkdir(tranform_store_path)
# os.mkdir(tranform_store_path +'linear_spectrogram')
# os.mkdir(tranform_store_path +'log_spectrogram')
# os.mkdir(tranform_store_path +'mel_spectrogram')
# for fold in np.arange (1,11):
# print(f'Processing fold {fold}')
# try:
# os.mkdir(tranform_store_path+'linear_spectrogram/'+ str(fold))
# os.mkdir(tranform_store_path+'log_spectrogram/'+ str(fold))
# os.mkdir(tranform_store_path+'mel_spectrogram/'+str(fold))
# except:
# pass #Folder exists
# for audio_file in tqdm(list(Path(audio_path/f'fold{fold}').glob('*.wav'))):
# samples, sample_rate = librosa.load(audio_file) #create onces with librosa
# #plot for linear spectrogram - without axis, tight
# fig = plt.figure(figsize=[0.72,0.72])
# ax = fig.add_subplot(111)
# ax.axes.get_xaxis().set_visible(False)
# ax.axes.get_yaxis().set_visible(False)
# ax.set_frame_on(False)
# Ydb = librosa.amplitude_to_db(librosa.stft(samples), ref=sample_rate)
# LS = librosa.display.specshow(Ydb, sr=sample_rate, x_axis='time', y_axis='linear')
# filename = tranform_store_path + 'linear_spectrogram/'+str(fold) +'/'+ str(audio_file).split('/')[-1:][0].replace('.wav','.png')
# plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
# plt.close('all')
# #plot for log spectrogram - without axis, tight
# fig = plt.figure(figsize=[0.72,0.72])
# ax = fig.add_subplot(111)
# ax.axes.get_xaxis().set_visible(False)
# ax.axes.get_yaxis().set_visible(False)
# ax.set_frame_on(False)
# LogS = librosa.display.specshow(Ydb, sr=sample_rate,x_axis='time', y_axis='log')
# filename = tranform_store_path + 'log_spectrogram/'+str(fold) +'/'+ str(audio_file).split('/')[-1:][0].replace('.wav','.png')
# plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
# plt.close('all')
# #plot for mel spectrogram - without axis, tight
# fig = plt.figure(figsize=[0.72,0.72])
# ax = fig.add_subplot(111)
# ax.axes.get_xaxis().set_visible(False)
# ax.axes.get_yaxis().set_visible(False)
# ax.set_frame_on(False)
# melS = librosa.feature.melspectrogram(y=samples, sr=sample_rate)
# librosa.display.specshow(librosa.power_to_db(melS, ref=np.max))
# filename = tranform_store_path + 'mel_spectrogram/'+str(fold) +'/'+ str(audio_file).split('/')[-1:][0].replace('.wav','.png')
# plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
# plt.close('all')
Validate that all 8K files converted into the three types of transformations
transforms = ['linear_spectrogram/','log_spectrogram/','mel_spectrogram/']
for transform in transforms:
count = 0
for fold in np.arange (1,11):
count += len(list(Path(tranform_store_path+transform+str(fold)).glob('*.png')))
print ('%s file count is %s'%(transform[:-1],count))
assert (len(df)==count)
classes = OrderedDict(sorted(df.set_index('classID').to_dict()['class'].items()))
classes
In this bit of code, we look at each of the transformed depictions across the ten-classes
fig, ax = plt.subplots(10,3, figsize=(16,16))
for k,v in classes.items():
sample = df[df['class']==v].sample(1)
sample_fold = sample['fold'].values[0]
sample_file = sample['slice_file_name'].values[0].replace('wav','png')
t_counter=0
for transform in transforms:
img = plt.imread(tranform_store_path+transform+str(sample_fold)+'/'+sample_file)
ax[k][t_counter].imshow(img, aspect='equal')
ax[k][t_counter].set_title(v+' transformed with '+ transform[:-1])
ax[k][t_counter].title.set_size(10)
ax[k][t_counter].set_axis_off()
t_counter+=1
fig.tight_layout()
plt.show()