In this notebook, I walk you through three different transformation types for audio (wav) files for a ten-class classification problem. In this example, I am using a vision-based algorithm; hence it is easy to visualize the importance of features from a visual perspective and their impact on model performance.

The three different transformation types are:

  • Linear Spectrograms
  • Log Spectrograms
  • Mel Spectrograms

You can learn more about these three transformations in Scott Duda's article and Ketan Doshi's writing, reasoning why Mel Spectrograms perform better in general for visual transformations of audio files.

This notebook will test these three transforms on this Urban Sounds 8K dataset and how they perform with a pre-trained vision-based model (Resnet-34) leveraging Fastaiv2. This notebook converts these sounds to a spectrogram then uses FastAI2 code base to classify these sounds. Code and approach in this notebook

There are ten folders in this dataset as part of the data source and we will approach this as a ten-fold cross-validation for a proper comparative metric with other research papers.

About the UrbanSounds8K dataset

Urban Sounds is a dataset of 8732 labeled sounds of less than 4 seconds each from 10 classes. Dataset for UrbanSounds8K contains these 10 classes:

  1. air_conditioner
  2. car_horn
  3. children_playing
  4. dog_bark
  5. drilling
  6. engine_idling
  7. gun_shot
  8. jackhammer
  9. siren
  10. street_music

Research with this dataset as of 2019 and optimized ML approaches as of late 2019 had classification accuracy at 74% with a k-nearest neighbours (KNN) algorithm. A deep learning neural network trained from scratch obtained accuracy at 76% accuracy.

Accuracy metrics

(accuracy metrics for research article)

Setup Section

The installs and incldues

You could use a non-GPU machine type for some file conversions as they are computationally expensive. For this, I am using an ml.p3.2xlarge. For the deep learning model, you could run it just as well on an ml.g4.2xlarge at a reduced cost.

#One time installs  - On AWS useconda_pytorch_p38 environment and add using ml.p3.2xlarge for this notebook
# !pip install librosa
# !pip install fastbook

#all the one time imports for this nb
import pandas as pd

from fastai.vision.all import *
from fastai.data.all import *
import matplotlib.pyplot as plt
from matplotlib.pyplot import specgram
import librosa
import librosa.display
import numpy as np
from pathlib import Path
import os
import random
import IPython
from tqdm import tqdm

from collections import OrderedDict

Download files from source then uncompress

These subsequent steps ensure adequate space to get the initial files and following transformations. I set aside 100GB.

# One time download files to local S3 folder
# !wget https://goo.gl/8hY5ER  #download
# !tar xf 8hY5ER #unpack tar file

File classification information

Note that this file provides classification information once unpacked.

df = pd.read_csv('UrbanSound8K/metadata/UrbanSound8K.csv')  #classification information across folds as provided from Urbansounds
df.head()
slice_file_name fsID start end salience fold classID class
0 100032-3-0-0.wav 100032 0.0 0.317551 1 5 3 dog_bark
1 100263-2-0-117.wav 100263 58.5 62.500000 1 5 2 children_playing
2 100263-2-0-121.wav 100263 60.5 64.500000 1 5 2 children_playing
3 100263-2-0-126.wav 100263 63.0 67.000000 1 5 2 children_playing
4 100263-2-0-137.wav 100263 68.5 72.500000 1 5 2 children_playing

Data exploration

Class distribution across the sound types

df.groupby('class').classID.count().sort_values(ascending=False).plot.bar()
plt.ylabel('count')
plt.title('Class distribution in the dataset')
Text(0.5, 1.0, 'Class distribution in the dataset')

These classification files show that overall classes are represented, with a gunshot and car horns being slightly underrepresented relative to all others.

Each of the folds has a relatively similar number of wav files.

df.groupby(['fold']).classID.count().sort_values(ascending=False).plot.bar()
plt.ylabel('Files in each fold')
plt.title('Files in each fold')
Text(0.5, 1.0, 'Files in each fold')
Inspect the files - audio and single tranform of a audio file

Listen to a sample wav file - it happens to be a dog barking, an urban sound.

audio_file= 'UrbanSound8K/audio/fold5/100032-3-0-0.wav'   #dog bark in fold 5

IPython.display.Audio(audio_file)

Single File Transformation

Linear Spectrogram

With the librosa module, these are the steps to convert a single wave file to a simple spectrogram. We stay with the dog barking audio representation as a spectrogram.

samples, sample_rate = librosa.load(audio_file)
Ydb = librosa.amplitude_to_db(librosa.stft(samples), ref=sample_rate)
plt.figure(figsize=(18, 6))
librosa.display.specshow(Ydb, sr=sample_rate, x_axis='time', y_axis='linear')
plt.colorbar()
/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/librosa/util/decorators.py:88: UserWarning: amplitude_to_db was called on complex input so phase information will be discarded. To suppress this warning, call amplitude_to_db(np.abs(S)) instead.
  return f(*args, **kwargs)
<matplotlib.colorbar.Colorbar at 0x7f86eeaa0be0>
Log Spectrogram

Again, a spectrogram can have a log representation for the dog barking in log space.

plt.figure(figsize=(18, 6))
librosa.display.specshow(Ydb, sr=sample_rate, x_axis='time', y_axis='log')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7f86eec16ac0>
Mel Spectrogram

And finally, this code transforms the same audio file into a mel-spectrogram. Note the added level of intensity and representation in a mel-spectrogram.

S = librosa.feature.melspectrogram(y=samples, sr=sample_rate)
Sdb = librosa.power_to_db(S, ref=np.max)
plt.figure(figsize=(18, 6))
librosa.display.specshow(Sdb, sr=sample_rate, x_axis='time', y_axis='mel')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7f86eec09910>

8K Files into Three Transformations

The below code creates folders for the transformed image files from the 8K wav files. Note the minor change from the single-file approach for images - we can drop the axes as machines don't need these - a picture without axes works best.

This step takes a significant amount of time; it needs to be done once.

audio_path = Path('UrbanSound8K/audio/')  # un zipped source audio files are in this location as wav files
tranform_store_path = 'UrbanSoundTransforms/'  #destination folder for each transformed image state

#make initial folders once
#os.mkdir(tranform_store_path)
# os.mkdir(tranform_store_path +'linear_spectrogram')
# os.mkdir(tranform_store_path +'log_spectrogram')
# os.mkdir(tranform_store_path +'mel_spectrogram')

# for fold in np.arange (1,11):
#     print(f'Processing fold {fold}')
#     try:
#         os.mkdir(tranform_store_path+'linear_spectrogram/'+ str(fold))
#         os.mkdir(tranform_store_path+'log_spectrogram/'+ str(fold))
#         os.mkdir(tranform_store_path+'mel_spectrogram/'+str(fold))
#     except:
#         pass #Folder exists
#     for audio_file in tqdm(list(Path(audio_path/f'fold{fold}').glob('*.wav'))):
#         samples, sample_rate = librosa.load(audio_file)  #create onces with librosa
        
#         #plot for linear spectrogram - without axis, tight 
        
#         fig = plt.figure(figsize=[0.72,0.72])
#         ax = fig.add_subplot(111)
#         ax.axes.get_xaxis().set_visible(False)
#         ax.axes.get_yaxis().set_visible(False)
#         ax.set_frame_on(False)
#         Ydb = librosa.amplitude_to_db(librosa.stft(samples), ref=sample_rate)
#         LS = librosa.display.specshow(Ydb, sr=sample_rate, x_axis='time', y_axis='linear')
#         filename  = tranform_store_path + 'linear_spectrogram/'+str(fold) +'/'+ str(audio_file).split('/')[-1:][0].replace('.wav','.png')
#         plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
#         plt.close('all')
        
#         #plot for log  spectrogram - without axis, tight 
#         fig = plt.figure(figsize=[0.72,0.72])
#         ax = fig.add_subplot(111)
#         ax.axes.get_xaxis().set_visible(False)
#         ax.axes.get_yaxis().set_visible(False)
#         ax.set_frame_on(False)
#         LogS = librosa.display.specshow(Ydb, sr=sample_rate,x_axis='time', y_axis='log')
#         filename  = tranform_store_path + 'log_spectrogram/'+str(fold) +'/'+ str(audio_file).split('/')[-1:][0].replace('.wav','.png')
#         plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
#         plt.close('all')
        
#         #plot for mel spectrogram - without axis, tight
        
#         fig = plt.figure(figsize=[0.72,0.72])
#         ax = fig.add_subplot(111)
#         ax.axes.get_xaxis().set_visible(False)
#         ax.axes.get_yaxis().set_visible(False)
#         ax.set_frame_on(False)
#         melS = librosa.feature.melspectrogram(y=samples, sr=sample_rate)
#         librosa.display.specshow(librosa.power_to_db(melS, ref=np.max))
#         filename  = tranform_store_path + 'mel_spectrogram/'+str(fold) +'/'+ str(audio_file).split('/')[-1:][0].replace('.wav','.png')
#         plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
#         plt.close('all')
        
Validate all files are transformed in destination folds

Validate that all 8K files converted into the three types of transformations

transforms = ['linear_spectrogram/','log_spectrogram/','mel_spectrogram/']
for transform in transforms:
    count = 0
    for fold in np.arange (1,11):
        count += len(list(Path(tranform_store_path+transform+str(fold)).glob('*.png')))
    print ('%s file count is %s'%(transform[:-1],count))
    assert (len(df)==count)
linear_spectrogram file count is 8732
log_spectrogram file count is 8732
mel_spectrogram file count is 8732

classes = OrderedDict(sorted(df.set_index('classID').to_dict()['class'].items()))
classes
OrderedDict([(0, 'air_conditioner'),
             (1, 'car_horn'),
             (2, 'children_playing'),
             (3, 'dog_bark'),
             (4, 'drilling'),
             (5, 'engine_idling'),
             (6, 'gun_shot'),
             (7, 'jackhammer'),
             (8, 'siren'),
             (9, 'street_music')])
Visual inspection for each sound category

In this bit of code, we look at each of the transformed depictions across the ten-classes

fig, ax = plt.subplots(10,3, figsize=(16,16))
for k,v in classes.items():
    sample = df[df['class']==v].sample(1)
    sample_fold = sample['fold'].values[0]
    sample_file = sample['slice_file_name'].values[0].replace('wav','png')
    t_counter=0
    for transform in transforms:
        img = plt.imread(tranform_store_path+transform+str(sample_fold)+'/'+sample_file)
        ax[k][t_counter].imshow(img, aspect='equal')
        ax[k][t_counter].set_title(v+' transformed with '+ transform[:-1])
        ax[k][t_counter].title.set_size(10)
        ax[k][t_counter].set_axis_off()
        
        t_counter+=1
fig.tight_layout()
plt.show()