Music Genre Classification using Hidden Markov Models

Music genre classification has been an interesting problem in the field of Music Information Retrieval (MIR). In this tutorial, We will try to classify music genre using hidden Markov models which are very good at modeling time series data. As Music audio files are time series signals, we expect that HMMs will suit our needs and give us an accurate classification. An HMM is a model that represents probability distributions over sequences of observations. We assume that the outputs are generated by hidden states. To learn more about HMM click click here. Also, you can find a good explanation for HHMs in the time series use case here.

Dataset & Features

We will use a dataset provided by Marsyas (Music Analysis, Retrieval, and Synthesis for Audio Signals) which is an open source software, called GTZAN. It’s a collection of 1000 audio tracks each 30 seconds long. There are 10 genres represented, each containing 100 tracks. All the tracks are 22050Hz Mono 16-bit audio files in .au format. In our tutorial, we will use all provided genres (blues, classical, jazz, country, pop, rock, metal, disco, hip-hop, reggae). For music genre classification, we will be easier for us to use WAV files, because they can be easily read by the scipy library. We would, therefore, have to convert our AU files to WAV format. The dataset can be accessed here.

For audio processing, we needed to find a way to concisely represent song waveforms. Mel Frequency Cepstral Coefficients (MFCC) is a good way to do this. MFCC takes the power spectrum of a signal and then uses a combination of filter banks and discrete cosine transform to extract features. More details about MFCC can be found here.

Let’s get started by importing the necessary libraries for our project.

    from python_speech_features import mfcc, logfbank
    from import wavfile
    import numpy as np
    import matplotlib.pyplot as plt
    from hmmlearn import hmm
    from sklearn.metrics import confusion_matrix
    import itertools
    import os

Let’s pick one song from our dataset and extract MFCC and Filter banks features.

    sampling_freq, audio ="genres/blues/blues.00000.wav")
    mfcc_features = mfcc(audio, sampling_freq)
    filterbank_features = logfbank(audio, sampling_freq)

We take a look at the shapes of the extracted features.

print ('\nMFCC:\nNumber of windows =', mfcc_features.shape[0])
print ('Length of each feature =', mfcc_features.shape[1])
print ('\nFilter bank:\nNumber of windows =', filterbank_features.shape[0])
print ('Length of each feature =', filterbank_features.shape[1])

MFCC: Number of windows = 1496 Length of each feature = 13 Filter bank: Number of windows = 1496 Length of each feature = 26

Now, let’s go through some samples of our dataset. We loop over the genres folders and visualize the MFCC features for the first song in each folder.

import glob
import os.path as path
genre_list = [blues,classical, jazz, country, pop, rock, metal, disco, hiphop, reggae]
figure = plt.figure(figsize=(20,3))
for idx ,genre in enumerate(genre_list): 
   example_data_path = genres/ + genre
   file_paths = glob.glob(path.join(example_data_path, *.wav))
   sampling_freq, audio =[0])
   mfcc_features = mfcc(audio, sampling_freq, nfft=1024)
   print(file_paths[0], mfcc_features.shape[0])
   plt.text(150, -10, genre, horizontalalignment=center, fontsize=20)

songs mfcc

Building Hidden Markov Models

We built a class to handle HMM training and prediction by wrapping the model provided by hmmlearn library.

    class HMMTrainer(object):
    def __init__(self, model_name='GaussianHMM', n_components=4, cov_type='diag', n_iter=1000):
        self.model_name = model_name
        self.n_components = n_components
        self.cov_type = cov_type
        self.n_iter = n_iter
        self.models = []
        if self.model_name == 'GaussianHMM':
        self.model = hmm.GaussianHMM(n_components=self.n_components,        covariance_type=self.cov_type,n_iter=self.n_iter)
        raise TypeError('Invalid model type') 

    def train(self, X):
        # Run the model on input data
    def get_score(self, input_data):
        return self.model.score(input_data)

Training & evaluating the Hidden Markov Models

For training the Hidden Markov Models, We loop over the subfolder in our dataset, we iterate over the songs in the subfolder in order to extract the features and append it to a variable. We should store all trained HMM models, so we will be able to predict the class of unseen songs. As HMM is a generative model for unsupervised learning, we don’t need labels to build HMM models for each class. We explicitly assume that separate HMM models will be built for each class. Note that we have used 4 as the number of components which is exactly the number of hidden state in the HMM models. Finding out the best number of states is about testing and experimenting with different values and pick the one that optimizes the predictions.

    hmm_models = []
    input_folder = 'genres/'
    # Parse the input directory
    for dirname in os.listdir(input_folder):
        # Get the name of the subfolder
        subfolder = os.path.join(input_folder, dirname)
        if not os.path.isdir(subfolder):
        # Extract the label
        label = subfolder[subfolder.rfind('/') + 1:]
        # Initialize variables
        X = np.array([])
        y_words = []
        # Iterate through the audio files (leaving 1 file for testing in each class)
        for filename in [x for x in os.listdir(subfolder) if x.endswith('.wav')][:-1]:
                # Read the input file
                filepath = os.path.join(subfolder, filename)
                sampling_freq, audio =
                # Extract MFCC features
                mfcc_features = mfcc(audio, sampling_freq)
                # Append to the variable X
                if len(X) == 0:
                    X = mfcc_features
                    X = np.append(X, mfcc_features, axis=0)

                # Append the label
        print('X.shape =', X.shape)
        # Train and save HMM model
        hmm_trainer = HMMTrainer(n_components=10)
        hmm_models.append((hmm_trainer, label))
        hmm_trainer = None

Now it’s time to evaluate our models, We iterate over the test dataset subfolders, we extract the features then we iterate through all HMM models and pick the one with the highest score.

input_folder = 'test/'
real_labels = []
pred_labels = []
for dirname in os.listdir(input_folder):

  subfolder = os.path.join(input_folder, dirname)
  if not os.path.isdir(subfolder):
  # Extract the label
  label_real = subfolder[subfolder.rfind('/') + 1:]

  for filename in [x for x in os.listdir(subfolder) if x.endswith('.wav')][:-1]:
    filepath = os.path.join(subfolder, filename)
    sampling_freq, audio =
    mfcc_features = mfcc(audio, sampling_freq)
    max_score = -9999999999999999999
    output_label = None
    for item in hmm_models:
       hmm_model, label = item
       score = hmm_model.get_score(mfcc_features)
       if score > max_score:
          max_score = score
          output_label = label

Till now, We have the real label for each song in our test dataset and we have estimated the predicted ones. As our problem is a multiclass classification, the best way to evaluate our model’s performance is to look at the confusion matrix (read more here). We use the confusion matrix provided by sklearn and we visualize it using matplotlib library. First, let’s define a function to take care of the matric plotting. This function prints and plots the confusion matrix. Normalization can be applied by setting normalize=True.

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
        print('Confusion matrix, without normalization')


    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Time to compute the confusion matrix and visualize it!

    cm = confusion_matrix(real, pred)
    classes = ["blues","classical", "country", "disco", "hiphop", "jazz", "metal", "pop", "reggae", "rock"]
    plot_confusion_matrix(cm, classes=classes, normalize=True,
                        title='Normalized confusion matrix')

confusion matrix

For a perfect classifier, we would have expected a diagonal of dark squares from the left-upper corner to the right lower one, and light colors for the remaining area. As we can see from the matrix plot, The HMM model for the ‘Pop’ class is able to perfectly predict the class of the songs. ‘Country’ and ‘Hip-hop’ are also performing good. However, the worst model is the one for ‘Rock’ class! sorry, rock-fans! For an even better evaluation, It’s always recommended to use the precision and recall metrics for the classification problem. sklearn made it easy for us to have a detailed report about the precision and recall of our multiclass classification. We just need to provide the real and predicted values and our classes names.

    from sklearn.metrics import classification_report
    print(classification_report(real, pred, target_names=classes))

The output looks like:

precision    recall  f1-score   support

      blues       0.50      0.44      0.47         9
  classical       1.00      0.67      0.80         9
    country       0.50      0.89      0.64         9
      disco       0.50      0.44      0.47         9
     hiphop       0.70      0.78      0.74         9
       jazz       0.60      0.67      0.63         9
      metal       0.67      0.67      0.67         9
        pop       0.75      1.00      0.86         9
     reggae       0.80      0.44      0.57         9
       rock       0.33      0.22      0.27         9

avg / total       0.64      0.62      0.61        90


That brings us to the end of this tutorial. The performance of our Hidden Markov models is relatively average, and obviously, there is a huge room for improvement. Indeed, the Hidden Markov Models are interpretable and powerful for time series data, however, they require a lot of fine-tuning (number of hidden states, input features …). Next, It would be good to try other approaches for classifying music genres such as a recurrent neural network (LSTM).

Written on May 1, 2019