Final Project
CS 429: Computer Vision
Matt Hoffman
Album Cover-Based Genre Classification
Introduction:
Automatic genre classification is a popular problem in the machine listening community. The ability to classify a song or album into a broad category without human interaction has numerous applications for the recording and music retail industries. Until now, most work on this problem has focused on extracting relevant features from audio data, or occasionally other metadata (musical notation, user playlists, etc.) which is frequently hand-labeled. This project attempts to make use of an additional source of data: the cover artwork associated with albums.
Previous Work:
Although significant amounts of work have been done on applying audio features to genre recognition, these features are obviously unavailable here. Instead, my project uses image features of the sort commonly used as low-level input to computer vision systems.
Data:
I acquired my data set from the AllMusic Guide (http://www.allmusic.com), which has an extensive database of artists and albums browsable by, among other things, genre. Images of album covers are stored as 200 pixel by 200 pixel JPEG images. I wrote a program in Java to automatically scrape a collection of these images (sorted by genre) from the allmusic.com website. This collection was then culled to 59 albums per genre for training and testing. The genres my system considers are Country, Easy Listening, Electronica, Hard Rock, Jazz, New Age, and Rap. This selection of genres was chosen in the hope that the album covers within each genre would be easily distinguished from those of other genres. Classifying an album as Folk versus Country, for example, might be difficult even for a human being based on album cover alone.
Below are links to tiled images of the album covers used in this project, organized by genre:
Country
Easy Listening
Electronica
Hard Rock
Jazz
New Age
Rap
Features:
The features I used fall into four categories: those based on color histograms, those derived from the frequency domain, those based on the Canny edge detection algorithm, and those based on a corner detection algorithm.
Color Histograms:
These are obtained for each image by calculating a 50-bin histogram of the hue, saturation, and brightness values of every pixel in the image. This results in 50 * 3 = 150 features describing the hue, saturation, and brightness content of the image.
Cepstral Coefficients:
The next set of features are obtained from the frequency-domain representation of each image, which is obtained by taking the magnitude spectrum of the image using a Discrete Cosine Transform (DCT). To obtain a coarser representation of the frequency content of the image, I take the DCT of the log-magnitude spectrum, obtaining a cepstral representation of the image. I then discard all but the lowest 10x10 square of coefficients, which contain information about the low-frequency activity in the spectrum, which should be most important. This adds another 10 * 10 = 100 features to describe the frequency content of the image.
Edge Features:
To capture information about the edge content of images, I apply the Canny edge detection algorithm to each album cover and take statistics about the resulting image. The first statistic I calculate is simply the number of pixels determined to be edges using nonmaximum suppression and hysteresis thresholding with a sigma of 3 pixels, a horizontal threshold of 0.3, and a vertical threshold of 0.3. I then, to get finer grain information about edge distribution, break the image into a grid of 25x25-pixel sections, and calculate the standard deviation, minimum, and maximum of the number of edge pixels in all the sections. To get specific information about the distribution of edge content around the border of the cover compared with the center, I also separately calculate the mean and standard deviations of the number of edge pixels in the sections of the grid that are in the top 1/3 of sections closest to the edge of the image and for those that are not. These operations produce an additional 8 features.
Corner Features:
These features are very similar to the edge content features, except applied to the output of a corner detection algorithm instead of an edge detection algorithm. Statistics about the pixels marked as edges (within a neighborhood of 3 pixels, with a sigma of 3 pixels, and a threshold of 50) are calculated as above about the total number of corners, the standard deviation, minimum, and maximum corners in the 25x25-pixel sections, and the means and standard deviations of the numbers of corners in sections around the edge and in the middle. This produces another 8 features.
In total, I extract 150 + 100 + 8 + 8 = 266 features for each image.
Classification:
I use a Support Vector Machine (SVM) classifier using a Radial Basis Function (RBF) kernel to do the actual classification. The implementation is from the WEKA Java library for machine learning, which implements a number of standard classifiers. I tried using several other classifiers, including naive Bayesian networks, ADABoost with decision stumps, and k-nearest neighbors, but SVMs produced the best results.
Results - Feature Data:
HSV Histograms:
The mean histograms across all albums for hue, saturation, and brightness can be found here:
Hue
Saturation
Brightness
The deviations from these means for the mean histogram values of each of the seven genres can be found here:
Country:
Hue
Saturation
Brightness
Easy Listening:
Hue
Saturation
Brightness
Electronica:
Hue
Saturation
Brightness
Hard Rock:
Hue
Saturation
Brightness
Jazz:
Hue
Saturation
Brightness
New Age:
Hue
Saturation
Brightness
Rap:
Hue
Saturation
Brightness
Composite graphs of deviations from the mean for all genres can be found here:
Hue
Saturation
Brightness
Cepstral Coefficients:
A plot of the mean of the cepstral coefficients for all images can be found here.
Plots of the deviations from the mean of the average cepstral coefficients for each genre can be found below:
Country
Easy Listening
Electronica
Hard Rock
Jazz
New Age
Rap
Edge Features:
Deviations from the mean for the mean edge feature values of each of the seven genres are summarized in this plot. The features are presented in the order described above: sum of all edge pixels; std dev, min, and max of edge pixels in each section; and mean and std dev of edge pixels in each section in the border and center of the image.
Corner Features:
Deviations from the mean for the mean corner feature values of each of the seven genres are summarized in this plot. These features are presented in the same order described above.
Results - Classification:
This graph summarizes the results of my system. The y-axis represents the percentage of albums that are classified correctly as belonging to their actual genre within x (the corresponding value on the x-axis) tries. Each line represents a different subset of the available features. So, for example, when using all of the features available, the correct genre is within the classifier's top three guesses in over 60% of cases, whereas random guessing (the baseline) would only have had the correct answer in its top three a little more than 40% of the time. To test my system I used repeated 10-fold cross-validation, dividing the data set randomly into 10 equal-sized subsets, testing the classifier with each and training with the rest. I repeated this process many times until the average results showed little change.
The system successfully guesses the genre on the first try only about 26.6% using all of the available features, but this is substantially better than the baseline result of 1/7 or 14.3%. That the system does not successfully guess the correct genre a majority of the time should not be too surprising, since it does not have access to the semantic information that human beings do, and even with such information it is completely possible for humans to misclassify an album's genre if only the cover art is available. Examples of such semantic information include titles and artist names ("25 Country Hits" gives a fairly good indication of the genre of an album, for example) or images of artists (rock musicians tend to look recognizably different from country musicians, but accurately distinguishing between them may require the cognitive ability to analyze fine visual details).
Each of the sets of features contributed something to the accuracy of the system, but the returns diminished substantially as more information was added. In the case of adding corner information to edge information, no improvement becomes visible until lower-ranked choices are considered. The most important features seem to be those associated with cepstral information and brightness histograms. It is possible that better results could be had by taking further advantage of information in the frequency domain, or by using some kind of template-based approach of the sort used in scene categorization, which this problem bears substantial resemblances to.
Below is a confusion matrix detailing what sorts of mistakes the program makes when it misclassifies albums. Names on the left indicate the correct genre, while the names at the top are those with which an album was confused. For example, 61 new age albums were identified as electronica albums. Interestingly, many albums of all genres were misclassified as electronica. This is perhaps due to the frequenly simple color schemes and designs of albums in this genre. Jazz was frequently mistaken for country, which may be accounted for by the overlap in time period that the albums being considered were recorded during. It is worth noting that release date probably plays as important a role in classification as genre, since design styles change over time, and the albums AllMusic Guide designates as "classic" for a genre will tend to clump together in time. So it is possible that the system is seizing on information that could better be correlated with release date than genre proper.
Conclusion:
This system did a marginally successful job of classifying images of album covers into genres. Its performance, though significantly better than baseline, was nonetheless not stellar. It is possible that more sophisticated shape matching techniques could be of use in improving these results, since no attempt to use such information was made. Although the performance of this classifier was not really adequate for use in applications on its own, it is possible that when combined with audio features it could be of some use in slightly improving results, since there is no reason to expect any of the features used in this system to show strong correlations with those