Urban Sounds 8k Classification with Hugging Face Demo
First tutorial using fastpages Jupyter notebook blog - Deploy a fast ai model to HuggingFaces
Here is a demo of what this model at Hugging Face : Urban Sound 8K Classification
Urban Sounds is a dataset of 8732 labeled sounds of less than 4 seconds each from 10 classes. Dataset for UrbanSounds8K contains these 10 classes:
- air_conditioner
- car_horn
- children_playing
- dog_bark
- drilling
- engine_idling
- gun_shot
- jackhammer
- siren
- street_music
Research with this dataset as of 2019 and optimized ML approaches as of late 2019 had classification accuracy at 74% with a k-nearest neighbours (KNN) algorithm. A deep learning neural network trained from scratch obtained accuracy at 76% accuracy.
(accuracy metrics for research article)
The state-of-the-art methods for audio classification approach this problem as an image classification task. For such image classification problems from audio samples, three common transformation approaches are:
-
- Linear Spectrograms
-
- Log Spectrograms
-
- Mel Spectrograms
You can learn more about these three transformations in Scott Duda's article and Ketan Doshi's writing, reasoning why Mel Spectrograms perform better in general for visual transformations of audio files.
The transformation on these audio files is another notebook that I will add a write-up here. You can find all of the associated code here.
# !pip install librosa
# !pip install fastbook
# !pip install gradio
import pandas as pd
from fastai.vision.all import *
from fastai.data.all import *
import matplotlib.pyplot as plt
from matplotlib.pyplot import specgram
import librosa
import librosa.display
import numpy as np
from pathlib import Path
import os
import random
import IPython
from tqdm import tqdm
from sklearn.metrics import accuracy_score
import gradio as gr
from collections import OrderedDict
df = pd.read_csv('UrbanSound8K/metadata/UrbanSound8K.csv') #classification information across folds as provided from Urbansounds
df['fname'] = df[['slice_file_name','fold']].apply (lambda x: str(x['slice_file_name'][:-4])+'.png'.strip(),axis=1 )
my_dict = dict(zip(df.fname,df['class']))
def label_func(f_name):
f_name = str(f_name).split('/')[-1:][0]
return my_dict[f_name]
df.groupby(['fold']).classID.count().sort_values(ascending=False).plot.bar()
plt.ylabel('Files in each fold')
plt.title('Files in each fold')
df.groupby('class').classID.count().sort_values(ascending=False).plot.bar()
plt.ylabel('count')
plt.title('Class distribution in the dataset')
all_folds = list(np.arange(1,11))
all_folders = [str(i) for i in all_folds]
image_files_loc = 'UrbanSoundTransforms/mel_spectrogram/'
all_files = get_image_files(image_files_loc,recurse=True, folders =all_folders )
dblock = DataBlock(blocks=(ImageBlock,CategoryBlock),
get_y = label_func,
splitter = RandomSplitter(seed=1),
)
dl = dblock.dataloaders(all_files)
print ('Train has {0} images and test has {1} images.' .format(len(dl.train_ds),len(dl.valid_ds)))
learn = vision_learner(dl, resnet34, metrics=accuracy)
learn.fine_tune(3)
learn.export()