Winner’s notes. Eleftherios Spyromitros – Xioufis on Music Instruments Recognition

By Eleftherios Spyromitros Xioufis (lefman), winner of the Music Instruments track of ISMIS 2011 Contest: Music Information Retrieval.


I am currently in the 1st year of my PhD studies in the Department of Informatics of the Aristotle University of Thessaloniki and member of the Machine Learning and Knowledge Discovery (MLKD) group. One of the main topics of my research is multi-label classification which is the generalization of single-label (binary or multi-class) classification in domains where each instance can be associated with more that one label at the same time. ISMIS 2011 contest on Music Information Retrieval and particularly the music instrument recognition track, gave me the opportunity to: 1) test my data mining skills in a challenging and highly competitive environment and 2) apply my research into a new and interesting application domain. Multi-label classification seems to fit well into the problem of recognizing pairs of instruments, which is actually a two-label classification problem.

The given data

The given training data consisted of two significantly heterogeneous datasets: one containing single instrument examples and one containing examples of instrument pairs. The single instrument data consisted of 114914 recordings of 19 different instruments. The instrument pairs data comprised 5422 recordings of mixtures of 21 different instruments. In total there were 32 distinct instruments, just 8 of which appeared in both datasets. It is interesting to notice that the pairs dataset contained instruments that can be considered as kinds of instruments in the single instruments dataset (e.g. CTrumpet and B-FlatTrumpet are kinds of Trumpet). These relations complicated the learning problem. Firstly, examples of the specialized class (e.g. TenorTrombone) could be semantically considered as examples of the general class (e.g. Trombone). Secondly, different kinds of the same instrument could be difficult to distinguish (e.g. is one of the instruments a soprano or an alto saxophone?). Besides the heterogeneity of the training sets, the following statements about the synthesis of the test set brought additional complexity to the learning task:

  • Test and training sets contain different pairs of instruments (i.e. the pairs from the training set do not occur in the test set).
  • Not all instruments from the training data must also occur in the test part.
  • There may be some instruments from the test set that only appear in the single instruments part of the training set.

To get a clearer idea about the synthesis of the test set, the evaluation system was queried (or tricked) for the frequency of each instrument in the test set by submitting a prediction containing the same instrument for all test instances. The results were quite revealing:

  •  Only 20 out of the 32 instruments appeared in the test set.
  •  The mixtures training set contained 18 of the 20 instruments of the test set plus 3 additional instruments.
  •  The single instruments training set contained 9 of the 20 instruments of the test set plus 10 additional instruments.
  • There was a great discrepancy between the distribution of the labels in the training and the test data.

Exploring multi-label approaches

Preliminary experiments showed that state-of-the-art multi-label methods such as ECC[2] and RAKEL[3] had little or no advantage in comparison with the baseline Binary Relevance (BR) method. All the above methods belong to the problem transformation family of multi-label algorithms (they transform the multi-label problem into multiple binary problems and tackle it with off-the-shelf binary classifiers). BR simply learns one model for each label (instrument) by using all the examples that contain that label as positive and the rest of the examples as negative. The coupling of BR with ensemble-based binary classifiers such as Random Forest[1] gave competitive results in comparison with more advanced multi-label methods. This result can be attributed to the fact that except for creating ensembles, the main advantage of these methods are the ability to capture correlations between labels. In our case, learning the correlations which appear in the training set was not expected to be useful since these correlations are not repeated in the test set.

Engineering the input

Given the heterogeneity of the training data, an important step was to explore the best input for the learning algorithms. Initially, three different training sets were given as input: a) the union of the given training sets (both mixtures and single-instruments), b) only mixture examples, c) only single-instruments examples. An evaluation using various learning algorithms showed that using the mixtures set was better than using  the single-instruments set. This was however expected, since the single-instruments set had examples for only 9 of the 20 instruments which appear in the test set, compared to the mixtures set which had examples for 18 instruments of the test set. The unexpected result that using the only-mixtures dataset gave better results than using the union of the given training sets, although examples for all 20 instruments which appear in the test set existed in the union.

A second set of experiments made things more clear. The training data corresponding to the 12 instruments which were not present in the test set were removed and the following training sets were created: a) One that contained both mixture and single-instrument examples for the instruments appearing in the test set. b) One that contained only mixture examples for the 18 out of 20 instruments and single-instrument examples for the 2 remaining instruments of the test set. c) One that contained only single-instrument examples for the 9 out of 20 instruments and mixture examples for the rest 11 instruments of the test set. The best results were obtained using the second training set, and revealed that learning from mixtures is better when mixtures of instruments are to be recognized. Note that adding single-instrument examples for the 2 instruments which had no examples in the mixtures set, slightly improved the performance of using only examples of mixtures. This revealed that using single-instrument data can be beneficial in the case that no mixture data is available. The set used to train the winning method consisted of the union of the 5422 mixture examples and the 340 single-instrument examples of SynthBass and Frenchhorn. All the given feature attributes describing the mixture examples were used, while  the 5 additional attributes of the single-instruments set were ignored since they were not present in the test set.

Modifying the base classifier

To deal with class imbalance (a problem arising from the use of BR for multi-label classification) we extended the original Random Forest (RF) algorithm. RF has been proven to have superior accuracy among current classification algorithms, however, it is susceptible on imbalanced learning situations. The idea was to combine RF with Asymmetric Bagging [4]. Instead of taking a bootstrap sample from the whole training set, bootstrapping is executed only on the examples of the majority (negative) class. The Asymmetric Bagging Random Forest (ABRF) algorithm is given below:

  1. Take a sample with replacement from the negative examples with size equal to the number of positive examples. Use all the positive examples and the negative bootstrap sample to form the new training set.
  2. Train the original RF algorithm with the desired number of trees on the new training set.
  3. Repeat the two steps above for the desired number of times. Aggregate the predictions of all the individual random trees and make the final prediction.

Building a forest of 10 random trees on each one of 10 balanced training sets yielded the best evaluation results.

Informed ranking

The output produced for each label by an ABRF classifier is a confidence score of the label being true. This score is calculated by dividing the number of random trees that voted for the label with the total number of random trees. In the domain of the contest, we a priori knew that exactly two instruments are playing on each track, thus we  focused on producing an accurate ranking of the labels according to their relevance to each test instance and selected the two top-ranked labels. Instead of directly using the confidence scores to produce a ranking of the labels, we developed a ranking approach which takes the prior probability distribution of the labels into account. This approach is as follows:

  1. Use the trained classifiers to generate confidence scores for all test instances.
  2. Sort the list of confidence scores given for each label.
  3. Given a test instance, find its rank in the sorted list of confidences for each label. These ranks are indicative of the relevance of the instance to each label.
  4. Normalize the ranks produced from step 3 by dividing them with the estimated (based on their prior probabilities) number of relevant instances for each label in the test set and select the n labels with the lowest normalized rank.

In the context of the contest, we had the chance to use the frequencies of the labels in the validation set to estimate the number of relevant instances in the full test set. In a real-world situation, the prior probabilities of the labels in the training set could be used for this purpose.

Engineering the output

As a final step, a post-processing filter was applied which disallowed instrument pairs that were present in the training set. In such cases, the second-ranked label was substituted by the next label which would not produce a label pair of the training set when combined with the first-ranked label. This substitution was based on the assumption that the classifier is more confident for the first-ranked label. The information for this filter was given in the description of the task by the contest organizers.

Some conclusions

One interesting conclusion was that in multi-label learning problems, like the one of this contest, where modeling label correlations is not useful, combining simple multi-label learning techniques, such as Binary Relevance, with strong single-label learning techniques, such as Random Forest, can lead to better performance compared to state-of-the-art multi-label learning techniques. Another interesting conclusion was that it is better to use only mixture examples when pairs of instruments need to be recognized. An interesting direction for future contests would be the generalization of the task to the recognition of an arbitrary number of instruments playing together.

Software used


  1. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
  2. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label
    classification. In: Proceedings of ECML PKDD 2009, Bled, Slovenia, pp. 254–269
  3. Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multi-label classification.
    IEEE Transactions on Knowledge and Data Engineering (2011)
  4. Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric bagging and random subspace for
    support vector machines-based relevance feedback in image retrieval. IEEE Transactions
    on Pattern Analysis and Machine Intelligence 28, 1088–1099 (2006)

A paper describing this solution in more details will appear soon in the proceedings of ISMIS 2011.

Winners’ notes. CNSlab team on music instruments recognition

By Robert Coleman and Daniel Schoonover (CNSlab) from Cognitive NeuroSystems Lab, Department of Cognitive Science, UC Irvine, USA – 3rd in Music Instruments track of ISMIS 2011 Contest: Music Information Retrieval.

Two training datasets were provided, one larger one containing data taken from single instruments, and one smaller one with data from combinations of exactly two instruments. These two datasets contained both similar as well as unique labels. Overall, there were 32 literally distinct classes contained in the two training sets. Since the problem was one of multi-way classification, the first approach was the multi-layer perceptron. With 35 hidden neurons, the MLP was trained using Levenberg-Marquadt updating. The MLP was then used to evaluate the test set, and the top 2 activations (of the 32 output nodes) were assigned as labels to that sample point. This model performed with 38% accuracy. These results led us to believe that further investigation should be done on the test data, as MLP’s should perform significantly better than the nearest neighbor approach. Also, many inconsistencies existed between the two training set labels i.e. ‘alto sax’ and ‘saxophone’.  To investigate the distribution of the test samples, 32 ‘dummy’ scripts were submitted, each of which containing only one instrument class for both instruments and for every test sample. The resulting classification accuracy was collected for all the classes and represented the distribution of the preliminary test samples. Additionally, it was known that the preliminary and final test set was randomly chosen from the entire test set. Using this knowledge, the resulting distribution was used as priors on the 32 classes. Upon scrutinizing the returned test distribution, it was noticed that many of the classes which had similar names i.e. ‘clarinet’ vs. ‘B-flat clarinet’ only appeared as one class in the preliminary test set. With this knowledge, the classes which did not appear at all in the preliminary test set were either deleted, or their data combined with the classes which had similar names.

During initial investigation of the training data a traditional random forest (RF) classifier was used to test the baseline classifiability of the single instrument training dataset (details of the algorithm can be found in L. Breiman 2001).  A forest of 1000 decision stumps, each maximally ten splits deep, was trained.  Initial performance of this classifier was very good with error > 0.9%.  However, the traditional RF classifier is designed to handle discrete, scalar target values.  For this problem, training on the mixed interment data, with each datum belonging to two classes, would normally not have been feasible.  However, our group devised a method to train this algorithm using both the single instrument and mixed instrument training data.  We did so by generating new training sets, with one instance of the single instrument training data, and randomly sampling the mixed training data, with repeats and a non-uniform distribution that matched the prior information about the final test set that was gained from the dummy scripts, and labeling each repeated with one or the other of the two class labels provided by the training data.  This allowed the RF algorithm to be trained in a bootstrap-like method seeing the same datum several times, and seeing them with both labels attached to that datum.  Out-of-bag training error was optimal for the RF at roughly 300 trees, again each maximally ten splits deep.  Probability outputs for each class were obtained by the proportion of votes for that class to the total number of trees.

Initial leaderboard submissions determined classification success of the test data for this RF was 54.66% overall.  Next, a submission was made to the leaderboard by mirroring just the most probable RF class for each entry e.g. “cello,cello; violin,violin;…”.  Results from this submission had a leader board determined classification success of 46.02%, informing us that this RF algorithm was correctly selecting one of the two instruments in the test data 92% of the time, and the addition of the second most probable instrument correctly selecting the second instrument for roughly 16% of the entries.

The final model used a voting scheme to decide on the two instrument labels for each test sample. The first label was chosen from the highest RF vote. To decide instrument two, the two independently best performing MLPs were used with the RF probabilities. The output activations from the MLP’s and RF’s were weighted by each other, and by the prior distribution. Discarding the selection from the RF for label one, the highest vote from this ensemble was used to create the second label.

Special thanks to Dr Max Welling, Eli Bowen. All analysis was done in MATLAB, using the Neural Network and Randomforest-MATLAB toolboxes.

— Robert Coleman, Daniel Schoonover

Winners’ notes. Using Multi-Resolution Clustering for Music Genre Identification

By Amanda Schierz, Marcin Budka and Edward Apeh (domcastro, BeYou) from Bournemouth University, UK, 1st and 2nd in Music Genres track of ISMIS 2011 Contest: Music Information Retrieval.

Thanks for this competition – it was great fun. Software used: R, Weka, LibSVM, Matlab, Excel. This was the 2nd competition I had entered (the first being the SIAM biological one) and I only really entered because I had so much undergraduate marking to do!  We developed a novel approach to the problem which involved multi-resolution clustering and Error Correcting Output Coding. Our 2nd place approach involved transforming the cluster labels into feature vectors.

Method and Journey:

1. We trained on 50% of the training data using Weka and built an ensemble of a cost-sensitive random forest (number of trees 100, number of features 25), a Bayes Net and a neural network. This resulted in 77.44% on the preliminary dataset. It was very frustrating as we couldn’t improve on this. We then looked at semi-iterative relabeling schemes such as Error Correcting Output Coding (using Matlab and LibSVM). This resulted in 81.59% prediction accuracy.

2. We then decided to look at the “statistics” of number of performers, segments, genres etc. We used R to normalize the data (training and test data) and to carry out K-means clustering, k =6 for genres, k=60 for performers, k=2000 for possible songs etc. Taking each set of clusters independently didn’t give any information. However, as we had pasted the results into the same file, we noticed a distinct pattern when the cluster results were looked at together – even though no crisp clusters were identified, we noticed that if a training instance was of a different genre from the rest of the cluster then it usually belonged to a different lower granularity cluster. We then built lots of cluster sets for the data (multi-resolution clustering).  K was set to 6, 15, 20, 60, 300, 400, 600, 800, 900, 1050, 1200, 2000, 3000, 3200, 5000 and 7000 clusters. At the finest granularity cluster (k=7000) a majority cluster vote was taken using the training instance labels and the test set predictions – the whole cluster was relabelled to the “heaviest” class. If a cluster could not be converged at the finest  k-level then we “fell back” to a lower granularity cluster (k=5000) and so on. These new predictions were fed back to the ECOC system and the process was repeated.

3. Figure below shows the overall approach we came up with:

4. This was the winning solution and resulted in 0.87507 score on the final test set. For the 2nd place solution, we decided to look at using the cluster assignation labels as feature vectors. This transformed the problem from the original 171-dimensional input space, into a new 16-dimensional space, where each attribute was an identifier of the cluster at one of the 16 levels. So, for example, if instance #7 have fallen into the 3rd out of 6 clusters at the first granularity level, 10th out of 15 clusters at the second granularity level and so on, in the transformed space it would be described as a 16-diemensional vector: [3, 10, …]. Note, that these attributes are now categorical, with up to 7000 distinct values at the highest granularity level. This has limited the number of classifiers we could use.

Our classification system consisted of:
1. Random forest of 1000 unpruned C4.5 decision trees
2. Boosted ensemble of 10 C5.0 decision trees
3. Cross-trained ensemble of 100 Naive Bayes classifiers, trained on different subsets of attributes, each time selected using the Floating Forward Feature Selection method.

We have used majority voting to combine the decisions of these 3 ensembles. After labeling the test dataset using the method described above, we have fed both training and test dataset (this time with the labels from the previous step) to the ECOC system to obtain final predictions. This resulted in 0.87270 on the final test set.

— Amanda Schierz, Marcin Budka, Edward Apeh

Winners’ notes. Brian Jones on Incremental Transductive Ridge Regression

By Brian S. Jones (wahoo) from Sandia National Laboratories, USA, 3rd in Music Genres track of ISMIS 2011 Contest: Music Information Retrieval.

I became interested in the ISMIS 2011 genres contest due to the challenge that some contestants noted in the online forum:  standard model selection via cross-validation did not work well on the problem.  Supervised learning techniques I tried, such as SVM, FDA, and Random Forest, all achieved accuracy in the 90-95% range in k-fold CV, only to result in leaderboard test set accuracy in the 70-76% range.

I interpreted this performance drop as an indication that the sample selection bias and resulting dataset shift was significant.  I tried three categories of techniques in an attempt to produce a classifier that adapted to the test set distribution: standard transductive algorithms, importance weighting, and pseudo-labeling methods.

My best entry used what I call Incremental Transductive Ridge Regression.  The procedure pseudo-labels test points progressively over multiple iterations in an attempt to gradually adapt the classifier to the test distribution.  Labeled points can also be removed or reweighted over time to increase the significance of the unlabeled points.  The objective function minimized in each iteration is the combination of a labeled loss term, a pseudo-labeled loss term, and the standard L2 ridge regularizer:

The response vector yi for each point contains K entries, one for each genre, and is encoded in binary format where yik=1 if point i has label k and 0 otherwise.  Other coding schemes are possible, for example using error-correcting output codes or (K-1) orthogonal vectors.  The variable yi* is a pseudo-label vector for each unlabeled point, and Lt and Ut represent the sets of labeled and unlabeled point indices utilized in iteration t. The function f is a linear predictor with weights w, and predictions are produced by argmax f(x).

I experimented with several techniques for growing an initially empty Ut across T iterations.  The most successful approach was a stratified one, adding the most confident Fk / T predictions for every class in each round.  Confidence is determined by the multiclass margin, and Fk is the expected frequency of class k based on the labeled class distribution.  I kept all labeled points in Lt during the T iterations, but surprisingly found that performance increased by removing them all at the end and self-training for a few extra iterations (TII) using just the test points.

In the end, I was able to achieve 82.5% leaderboard accuracy using T=10, TII=5, C=1, λ=1.  I added another 0.5% by combining several of these classifiers in a voting ensemble, where diversity was introduced by bootstrap sampling the labeled set.  This increase may have been spurious, however, as it did not provide similar gains at larger ensemble sizes.

Along the way, I also experimented with semi-supervised manifold algorithms like LapRLS [1] and tried importance weighting using uLSIF [2], but found only modest gains.  Other pseudo-labeling techniques that produced around 80% accuracy for me were Large Scale Manifold Transduction [3] and Tri-training [4].

For implementation, I programmed in Python/SciPy and utilized the ‘scikits.learn’ package when experimenting with off-the-shelf classifiers. Reported results involve two pre-processing steps: duplicate entries in the data sets were removed and features were normalized to have zero mean and unit variance.

I would like to thank TunedIT, the members of Gdansk University of Technology, and any others who helped put together this challenging and fun event.

— Brian S. Jones


1. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. Journal of Machine Learning Research.
2. Kanamori, T., Hido, S., & Sugiyama, M. (2009). A Least-squares Approach to Direct Importance Estimation. Journal of Machine Learning Research.
3. Karlen, M., Weston, J., Erkan, A., & Collobert, R. (2008). Large Scale Manifold Transduction. Proceedings of the International Conference on Machine Learning.
4. Zhou, Z.-H., & Li, M. (2005). Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering.

Ed Ramsden on his winning solution in SIAM SDM’11 Contest

By Ed Ramsden (EdR), the winner of SIAM SDM’11 Data Mining Contest.

The basis for my model was the Probabilistic Neural Network (PNN), which was originally introduced by Donald Specht in 1990. The PNN is an example-based classifier in which the ‘X’ vector for an unknown case to be classified is compared to all known-class cases used in the training set.  A distance metric (typically Euclidean) is passed through a gaussian function to estimate a ‘probability’ of a match with each training case.  These individual probabilities are combined for each class in the training set, and the class with the highest composite probability (W) is selected as the most likely class for the unknown case.  The evaluation and combination function used was:

Although a PNN can be used with little or no training, this problem posed several difficulties. The first was the high dimensionality of the input data. Because they are example based, PNN classifiers require their input ‘space’ to be reasonably well filled in order to perform. As the number of input features is increased, one would expect their input space to become exponentially sparser. The solution to this was to employ feature selection.  Also, another challenge for obtaining good performance is the proper selection of σ,  which controls the selectivity of the classifier. If one makes  σ too large, the classifier will tend lose the ability to differentiate between different input data. On the other hand, if  σ is made too small, the classifier loses the ability to generalize beyond its training set.  The problems of both feature and  σ selection were solved by using a guided random walk, with the objective of maximizing the Modified Youdon performance on the training set. One feature of this approach is that it does not require the calculation of gradient information, only the value of the metric being maximized.  To avoid severe overtraining effects, a leave-one-out scheme was used to evaluate training-set performance.

Because the PNN model developed as described above only sees a small subset of available inputs, I decided to attempt to increase the performance through constructing ensembles of the PNNs, and then taking a simple vote among their outputs to decide the final classification.

As one can see from the following plot, there is substantial variation in both the training and final test Modified Youdon measures for different models, with a degree of correlation between the training metric and the final test metric.  This led to the idea of constructing the final voting pool out of a subset of models with superior training performance.

In the end, the submission model consisted of a vote of the best 25 out of 135 candidate PNN models (by training score) constructed using 35 features. This yielded a training score of 0.794, and a final test score of 0.689.  Note that while some individual sub-models would have had very similar performance to the ensemble model, there was no obvious way of reliably identifying such high-performing sub models a priori, so the ensemble technique allowed for the combination of a number of good (and not so good) models into a better one.

I developed the model generation code in Visual Basic .NET, and did the final vote taking using a spreadsheet.  The generation and tuning of the 135 candidate models required nearly 8  hours on a single processor core of an Intel E5300.

— Ed Ramsden

Winner’s notes. Yuchun Tang on noise deduction to improve classification accuracy in SIAM SDM’11 Contest

By Yuchun Tang (piaopiao), the runner-up in SIAM SDM’11 Contest.

QSAR data provided for SIAM SDM’11 Contest were known to be highly noisy. Around 30% of labels provided could be wrong due to experimental uncertainty, as reported by the organizers after the contest was closed. Furthermore, this contest only counted the last submission, which means it was risky to overtune the models on the known data (including training data and preliminary test data).

In my approach, initially, a 7-fold cross validation strategy was adopted for modeling on the training data. Several classification algorithms were tried and the best CV results (in terms of Balanced Youden Index) were observed with R gbm and randomForest techniques. At that point the performance for gbm was 0.659/0.664/0.640 (in the order of 7-CV/preliminary/final), and for rf it was 0.636/0.718/0.628. (Of course, I only know the final performance after the contest is closed). I also tried different feature selection methods but I did not see obvious improvement so I decided Read more of this post

Winners’ notes. Frank Lemke on self-organizing high-dimensional QSAR modeling from noisy data

By Frank Lemke (pat) from KnowledgeMiner Software, 3rd in SIAM SDM’11 Contest.

Safety of pharmaceutical and chemical products with respect to human health and the environment has been a major concern for the public, regulatory bodies, and the industry, for a long time and this demand is increasing. Safety aspects start in the early design phases of drugs and chemical compounds and they end formally with the official authorization by national and international regulators. Traditionally, for decades, animal tests have been using as the preferred accepted tool – kind of Gold Standard, which, in fact, it is not – for testing harmful effects of chemicals on living species or the environment. Currently, in Europe only, about 10 million animals per year are (ab)used for laboratory experiments, and a lot of time and billions of Euros are spent into these experiments. So, we as consumers who use and value chemical products every day everywhere in some form are safe? No! Not really. About 90% of the chemicals on the market today have never been tested or have not been requested, officially, to be tested. There is a simple reason, apparently: Despite the ethical issues of animal testing – it is estimated that additional 10 – 50 million vertebrate animals would be required if all 150,000 registered substances would have to be tested in this traditional way – it is simply not possible to run animal tests for this amount of substances within reasonable time and cost constraints. Animal tests cannot do that. To solve this problem, there is a strong demand for alternative testing methods like QSAR models to help minimizing and widely substituting animal tests in the future.

Read more of this post

Winners’ notes. Robert Wieczorkowski, Yasser Tabandeh and Harris Papadopoulos on SIAM SDM’11 Contest

We have a pleasure to publish three after-challenge reports authored by participants of SIAM DM 2011 Data Mining Contest who achieved ex aequo the 4th best score (differed only in time of submission). Hope you’ll find the reports insightful. To view full results of the competition see Leaderboard and Summary pages.

* * *

By Robert Wieczorkowski, Ph.D. (rwieczor).

SIAM SDM’11 Contest was my second challenge in which I participated on TunedIT. Previously I took part in IEEE ICDM Contest (Traffic Prediction) and ended on the 12th place. Taking part in this challenge was for me a form of new practical experience in using modern statistical tools.

Read more of this post

IEEE ICDM Contest – Overview of Top Solutions, part 2

As previously announced, we publish further descriptions of top solutions of the IEEE ICDM Contest: TomTom Traffic Prediction for Intelligent GPS Navigation. This time you may read reports on Task 2 “Jams” and Task 3 “GPS”. Solutions of Task 1 can be found in Part 1.

If you want to ask the authors any questions feel free to comment below the post.

Task 2 “Jams”

By  Łukasz Romaszko (lukasz21212121),
the winner, from the University of Warsaw, Poland.

The algorithm focuses on computing ranks for each street (the higher rank means greater probability of traffic jam) and the number of streets to be jammed. The highest ranked streets are given as the output. In particular it uses an adaptation of the k-Nearest Neighbors algorithm. The following description of the algorithm is simplified. The scheme presenting the algorithm idea is given below.

Computing street ranks and the number p of jammed streets consists of several steps. From the training data set there are chosen two ordered sets: Similarity and LengthSimilarity. Similarity is used to compute street ranks, while LengthSimilarity to determinate number p. On the basis of Similarity set and special functions algorithm generates an array RankingList. In the next step RankingList will be slightly modified by taking into consideration the locations of streets. The top p streets are given as the output.


Let us denote by T the test data: the set of identifiers of the 5 excluded segments followed by a sequence of major roads where the first jams occurred during initial 20 minutes of the simulation.

Generating Similarity and LengthSimilarity ordered sets

At the beginning the algorithm compares sequences from Training data to T. Two different measures were used to compute Similarity and LengthSimilarity. Sequences are compared based on the indices positions of roads which were jammed in both sequences. Let Δ be the difference between length of  current sequence D and length of T. |T| denotes the number of jammed roads in T. The measure used to generate Similarity assumes that sequences of similar Δ length are more reliable. It is worth emphasizing that the metric used to generate Similarity set has the greatest influence on the result.

Sequences evaluation in LengthSimilarity was  similar to that in Similarity, but took into  consideration only sequences of Δ <=10. The algorithm for computing values in Similarity is described below: Read more of this post

IEEE ICDM Contest – Overview of Top Solutions, part 1

The IEEE ICDM Contest: TomTom Traffic Prediction for Intelligent GPS Navigation came to an end. As promised, we publish descriptions of top solutions, provided by participants. Although the reports had to be brief, the authors not only revealed a good deal of important details about their approaches, but also kept the descriptions straightforward and concise, giving all of us an unprecedented opportunity to learn the essence of data mining know-how. This is a good supplement to fully scientific articles that will be presented during Contest Workshop at the ICDM conference in Sydney.

Today, we publish descriptions for Task 1, “Traffic”. In the nearest days we’ll make another post with Task 2 and 3 reports – stay tuned! We thank all the authors for their contributions.

If you want to ask the authors any questions, feel free to comment below the post.

* * *

By Alexander Groznetsky (alegro), the winner. Alex is an experienced data miner who had participated (nick orgela) in the Netflix Prize contest in its early days – this fact becomes pretty clear when you look at the list of algorithms used by him for ICDM – they sound very Netflix-like :). To learn about the task, see “Traffic” task description page.

Read more of this post

%d bloggers like this: