Winner’s notes. Eleftherios Spyromitros – Xioufis on Music Instruments Recognition

By Eleftherios Spyromitros Xioufis (lefman), winner of the Music Instruments track of ISMIS 2011 Contest: Music Information Retrieval.


I am currently in the 1st year of my PhD studies in the Department of Informatics of the Aristotle University of Thessaloniki and member of the Machine Learning and Knowledge Discovery (MLKD) group. One of the main topics of my research is multi-label classification which is the generalization of single-label (binary or multi-class) classification in domains where each instance can be associated with more that one label at the same time. ISMIS 2011 contest on Music Information Retrieval and particularly the music instrument recognition track, gave me the opportunity to: 1) test my data mining skills in a challenging and highly competitive environment and 2) apply my research into a new and interesting application domain. Multi-label classification seems to fit well into the problem of recognizing pairs of instruments, which is actually a two-label classification problem.

The given data

The given training data consisted of two significantly heterogeneous datasets: one containing single instrument examples and one containing examples of instrument pairs. The single instrument data consisted of 114914 recordings of 19 different instruments. The instrument pairs data comprised 5422 recordings of mixtures of 21 different instruments. In total there were 32 distinct instruments, just 8 of which appeared in both datasets. It is interesting to notice that the pairs dataset contained instruments that can be considered as kinds of instruments in the single instruments dataset (e.g. CTrumpet and B-FlatTrumpet are kinds of Trumpet). These relations complicated the learning problem. Firstly, examples of the specialized class (e.g. TenorTrombone) could be semantically considered as examples of the general class (e.g. Trombone). Secondly, different kinds of the same instrument could be difficult to distinguish (e.g. is one of the instruments a soprano or an alto saxophone?). Besides the heterogeneity of the training sets, the following statements about the synthesis of the test set brought additional complexity to the learning task:

  • Test and training sets contain different pairs of instruments (i.e. the pairs from the training set do not occur in the test set).
  • Not all instruments from the training data must also occur in the test part.
  • There may be some instruments from the test set that only appear in the single instruments part of the training set.

To get a clearer idea about the synthesis of the test set, the evaluation system was queried (or tricked) for the frequency of each instrument in the test set by submitting a prediction containing the same instrument for all test instances. The results were quite revealing:

  •  Only 20 out of the 32 instruments appeared in the test set.
  •  The mixtures training set contained 18 of the 20 instruments of the test set plus 3 additional instruments.
  •  The single instruments training set contained 9 of the 20 instruments of the test set plus 10 additional instruments.
  • There was a great discrepancy between the distribution of the labels in the training and the test data.

Exploring multi-label approaches

Preliminary experiments showed that state-of-the-art multi-label methods such as ECC[2] and RAKEL[3] had little or no advantage in comparison with the baseline Binary Relevance (BR) method. All the above methods belong to the problem transformation family of multi-label algorithms (they transform the multi-label problem into multiple binary problems and tackle it with off-the-shelf binary classifiers). BR simply learns one model for each label (instrument) by using all the examples that contain that label as positive and the rest of the examples as negative. The coupling of BR with ensemble-based binary classifiers such as Random Forest[1] gave competitive results in comparison with more advanced multi-label methods. This result can be attributed to the fact that except for creating ensembles, the main advantage of these methods are the ability to capture correlations between labels. In our case, learning the correlations which appear in the training set was not expected to be useful since these correlations are not repeated in the test set.

Engineering the input

Given the heterogeneity of the training data, an important step was to explore the best input for the learning algorithms. Initially, three different training sets were given as input: a) the union of the given training sets (both mixtures and single-instruments), b) only mixture examples, c) only single-instruments examples. An evaluation using various learning algorithms showed that using the mixtures set was better than using  the single-instruments set. This was however expected, since the single-instruments set had examples for only 9 of the 20 instruments which appear in the test set, compared to the mixtures set which had examples for 18 instruments of the test set. The unexpected result that using the only-mixtures dataset gave better results than using the union of the given training sets, although examples for all 20 instruments which appear in the test set existed in the union.

A second set of experiments made things more clear. The training data corresponding to the 12 instruments which were not present in the test set were removed and the following training sets were created: a) One that contained both mixture and single-instrument examples for the instruments appearing in the test set. b) One that contained only mixture examples for the 18 out of 20 instruments and single-instrument examples for the 2 remaining instruments of the test set. c) One that contained only single-instrument examples for the 9 out of 20 instruments and mixture examples for the rest 11 instruments of the test set. The best results were obtained using the second training set, and revealed that learning from mixtures is better when mixtures of instruments are to be recognized. Note that adding single-instrument examples for the 2 instruments which had no examples in the mixtures set, slightly improved the performance of using only examples of mixtures. This revealed that using single-instrument data can be beneficial in the case that no mixture data is available. The set used to train the winning method consisted of the union of the 5422 mixture examples and the 340 single-instrument examples of SynthBass and Frenchhorn. All the given feature attributes describing the mixture examples were used, while  the 5 additional attributes of the single-instruments set were ignored since they were not present in the test set.

Modifying the base classifier

To deal with class imbalance (a problem arising from the use of BR for multi-label classification) we extended the original Random Forest (RF) algorithm. RF has been proven to have superior accuracy among current classification algorithms, however, it is susceptible on imbalanced learning situations. The idea was to combine RF with Asymmetric Bagging [4]. Instead of taking a bootstrap sample from the whole training set, bootstrapping is executed only on the examples of the majority (negative) class. The Asymmetric Bagging Random Forest (ABRF) algorithm is given below:

  1. Take a sample with replacement from the negative examples with size equal to the number of positive examples. Use all the positive examples and the negative bootstrap sample to form the new training set.
  2. Train the original RF algorithm with the desired number of trees on the new training set.
  3. Repeat the two steps above for the desired number of times. Aggregate the predictions of all the individual random trees and make the final prediction.

Building a forest of 10 random trees on each one of 10 balanced training sets yielded the best evaluation results.

Informed ranking

The output produced for each label by an ABRF classifier is a confidence score of the label being true. This score is calculated by dividing the number of random trees that voted for the label with the total number of random trees. In the domain of the contest, we a priori knew that exactly two instruments are playing on each track, thus we  focused on producing an accurate ranking of the labels according to their relevance to each test instance and selected the two top-ranked labels. Instead of directly using the confidence scores to produce a ranking of the labels, we developed a ranking approach which takes the prior probability distribution of the labels into account. This approach is as follows:

  1. Use the trained classifiers to generate confidence scores for all test instances.
  2. Sort the list of confidence scores given for each label.
  3. Given a test instance, find its rank in the sorted list of confidences for each label. These ranks are indicative of the relevance of the instance to each label.
  4. Normalize the ranks produced from step 3 by dividing them with the estimated (based on their prior probabilities) number of relevant instances for each label in the test set and select the n labels with the lowest normalized rank.

In the context of the contest, we had the chance to use the frequencies of the labels in the validation set to estimate the number of relevant instances in the full test set. In a real-world situation, the prior probabilities of the labels in the training set could be used for this purpose.

Engineering the output

As a final step, a post-processing filter was applied which disallowed instrument pairs that were present in the training set. In such cases, the second-ranked label was substituted by the next label which would not produce a label pair of the training set when combined with the first-ranked label. This substitution was based on the assumption that the classifier is more confident for the first-ranked label. The information for this filter was given in the description of the task by the contest organizers.

Some conclusions

One interesting conclusion was that in multi-label learning problems, like the one of this contest, where modeling label correlations is not useful, combining simple multi-label learning techniques, such as Binary Relevance, with strong single-label learning techniques, such as Random Forest, can lead to better performance compared to state-of-the-art multi-label learning techniques. Another interesting conclusion was that it is better to use only mixture examples when pairs of instruments need to be recognized. An interesting direction for future contests would be the generalization of the task to the recognition of an arbitrary number of instruments playing together.

Software used


  1. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
  2. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label
    classification. In: Proceedings of ECML PKDD 2009, Bled, Slovenia, pp. 254–269
  3. Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multi-label classification.
    IEEE Transactions on Knowledge and Data Engineering (2011)
  4. Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric bagging and random subspace for
    support vector machines-based relevance feedback in image retrieval. IEEE Transactions
    on Pattern Analysis and Machine Intelligence 28, 1088–1099 (2006)

A paper describing this solution in more details will appear soon in the proceedings of ISMIS 2011.

%d bloggers like this: