**Profile**

I am currently in the 1st year of my PhD studies in the Department of Informatics of the Aristotle University of Thessaloniki and member of the Machine Learning and Knowledge Discovery (MLKD) group. One of the main topics of my research is multi-label classification which is the generalization of single-label (binary or multi-class) classification in domains where each instance can be associated with more that one label at the same time. ISMIS 2011 contest on Music Information Retrieval and particularly the music instrument recognition track, gave me the opportunity to: 1) test my data mining skills in a challenging and highly competitive environment and 2) apply my research into a new and interesting application domain. Multi-label classification seems to fit well into the problem of recognizing pairs of instruments, which is actually a two-label classification problem.

**The given data
**

The given training data consisted of two significantly heterogeneous datasets: one containing single instrument examples and one containing examples of instrument pairs. The single instrument data consisted of 114914 recordings of 19 different instruments. The instrument pairs data comprised 5422 recordings of mixtures of 21 different instruments. In total there were 32 distinct instruments, just 8 of which appeared in both datasets. It is interesting to notice that the pairs dataset contained instruments that can be considered as kinds of instruments in the single instruments dataset (e.g. CTrumpet and B-FlatTrumpet are kinds of Trumpet). These relations complicated the learning problem. Firstly, examples of the specialized class (e.g. TenorTrombone) could be semantically considered as examples of the general class (e.g. Trombone). Secondly, different kinds of the same instrument could be difficult to distinguish (e.g. is one of the instruments a soprano or an alto saxophone?). Besides the heterogeneity of the training sets, the following statements about the synthesis of the test set brought additional complexity to the learning task:

- Test and training sets contain different pairs of instruments (i.e. the pairs from the training set do not occur in the test set).
- Not all instruments from the training data must also occur in the test part.
- There may be some instruments from the test set that only appear in the single instruments part of the training set.

To get a clearer idea about the synthesis of the test set, the evaluation system was queried (or tricked) for the frequency of each instrument in the test set by submitting a prediction containing the same instrument for all test instances. The results were quite revealing:

- Only 20 out of the 32 instruments appeared in the test set.
- The mixtures training set contained 18 of the 20 instruments of the test set plus 3 additional instruments.
- The single instruments training set contained 9 of the 20 instruments of the test set plus 10 additional instruments.
- There was a great discrepancy between the distribution of the labels in the training and the test data.

**Exploring multi-label approaches**

Preliminary experiments showed that state-of-the-art multi-label methods such as ECC[2] and RAKEL[3] had little or no advantage in comparison with the baseline Binary Relevance (BR) method. All the above methods belong to the problem transformation family of multi-label algorithms (they transform the multi-label problem into multiple binary problems and tackle it with off-the-shelf binary classifiers). BR simply learns one model for each label (instrument) by using all the examples that contain that label as positive and the rest of the examples as negative. The coupling of BR with ensemble-based binary classifiers such as Random Forest[1] gave competitive results in comparison with more advanced multi-label methods. This result can be attributed to the fact that except for creating ensembles, the main advantage of these methods are the ability to capture correlations between labels. In our case, learning the correlations which appear in the training set was not expected to be useful since these correlations are not repeated in the test set.

**Engineering the input**

Given the heterogeneity of the training data, an important step was to explore the best input for the learning algorithms. Initially, three different training sets were given as input: a) the union of the given training sets (both mixtures and single-instruments), b) only mixture examples, c) only single-instruments examples. An evaluation using various learning algorithms showed that using the mixtures set was better than using the single-instruments set. This was however expected, since the single-instruments set had examples for only 9 of the 20 instruments which appear in the test set, compared to the mixtures set which had examples for 18 instruments of the test set. The unexpected result that using the only-mixtures dataset gave better results than using the union of the given training sets, although examples for all 20 instruments which appear in the test set existed in the union.

A second set of experiments made things more clear. The training data corresponding to the 12 instruments which were not present in the test set were removed and the following training sets were created: a) One that contained both mixture and single-instrument examples for the instruments appearing in the test set. b) One that contained only mixture examples for the 18 out of 20 instruments and single-instrument examples for the 2 remaining instruments of the test set. c) One that contained only single-instrument examples for the 9 out of 20 instruments and mixture examples for the rest 11 instruments of the test set. The best results were obtained using the second training set, and revealed that learning from mixtures is better when mixtures of instruments are to be recognized. Note that adding single-instrument examples for the 2 instruments which had no examples in the mixtures set, slightly improved the performance of using only examples of mixtures. This revealed that using single-instrument data can be beneficial in the case that no mixture data is available. The set used to train the winning method consisted of the union of the 5422 mixture examples and the 340 single-instrument examples of SynthBass and Frenchhorn. All the given feature attributes describing the mixture examples were used, while the 5 additional attributes of the single-instruments set were ignored since they were not present in the test set.

**Modifying the base classifier**

To deal with class imbalance (a problem arising from the use of BR for multi-label classification) we extended the original Random Forest (RF) algorithm. RF has been proven to have superior accuracy among current classification algorithms, however, it is susceptible on imbalanced learning situations. The idea was to combine RF with Asymmetric Bagging [4]. Instead of taking a bootstrap sample from the whole training set, bootstrapping is executed only on the examples of the majority (negative) class. The Asymmetric Bagging Random Forest (ABRF) algorithm is given below:

- Take a sample with replacement from the negative examples with size equal to the number of positive examples. Use all the positive examples and the negative bootstrap sample to form the new training set.
- Train the original RF algorithm with the desired number of trees on the new training set.
- Repeat the two steps above for the desired number of times. Aggregate the predictions of all the individual random trees and make the final prediction.

Building a forest of 10 random trees on each one of 10 balanced training sets yielded the best evaluation results.

**Informed ranking**

The output produced for each label by an ABRF classifier is a confidence score of the label being true. This score is calculated by dividing the number of random trees that voted for the label with the total number of random trees. In the domain of the contest, we a priori knew that exactly two instruments are playing on each track, thus we focused on producing an accurate ranking of the labels according to their relevance to each test instance and selected the two top-ranked labels. Instead of directly using the confidence scores to produce a ranking of the labels, we developed a ranking approach which takes the prior probability distribution of the labels into account. This approach is as follows:

- Use the trained classifiers to generate confidence scores for all test instances.
- Sort the list of confidence scores given for each label.
- Given a test instance, find its rank in the sorted list of confidences for each label. These ranks are indicative of the relevance of the instance to each label.
- Normalize the ranks produced from step 3 by dividing them with the estimated (based on their prior probabilities) number of relevant instances for each label in the test set and select the n labels with the lowest normalized rank.

In the context of the contest, we had the chance to use the frequencies of the labels in the validation set to estimate the number of relevant instances in the full test set. In a real-world situation, the prior probabilities of the labels in the training set could be used for this purpose.

**Engineering the output**

As a final step, a post-processing filter was applied which disallowed instrument pairs that were present in the training set. In such cases, the second-ranked label was substituted by the next label which would not produce a label pair of the training set when combined with the first-ranked label. This substitution was based on the assumption that the classifier is more confident for the first-ranked label. The information for this filter was given in the description of the task by the contest organizers.

**Some conclusions**

One interesting conclusion was that in multi-label learning problems, like the one of this contest, where modeling label correlations is not useful, combining simple multi-label learning techniques, such as Binary Relevance, with strong single-label learning techniques, such as Random Forest, can lead to better performance compared to state-of-the-art multi-label learning techniques. Another interesting conclusion was that it is better to use only mixture examples when pairs of instruments need to be recognized. An interesting direction for future contests would be the generalization of the task to the recognition of an arbitrary number of instruments playing together.

**Software used**

**References**

- Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
- Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label

classification. In: Proceedings of ECML PKDD 2009, Bled, Slovenia, pp. 254–269

(2009) - Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multi-label classification.

IEEE Transactions on Knowledge and Data Engineering (2011) - Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric bagging and random subspace for

support vector machines-based relevance feedback in image retrieval. IEEE Transactions

on Pattern Analysis and Machine Intelligence 28, 1088–1099 (2006)

A paper describing this solution in more details will appear soon in the proceedings of ISMIS 2011.

]]>Thanks for this competition – it was great fun. Software used: R, Weka, LibSVM, Matlab, Excel. This was the 2nd competition I had entered (the first being the SIAM biological one) and I only really entered because I had so much undergraduate marking to do! We developed a novel approach to the problem which involved multi-resolution clustering and Error Correcting Output Coding. Our 2nd place approach involved transforming the cluster labels into feature vectors.

Method and Journey:

1. We trained on 50% of the training data using Weka and built an ensemble of a cost-sensitive random forest (number of trees 100, number of features 25), a Bayes Net and a neural network. This resulted in 77.44% on the preliminary dataset. It was very frustrating as we couldn’t improve on this. We then looked at semi-iterative relabeling schemes such as Error Correcting Output Coding (using Matlab and LibSVM). This resulted in 81.59% prediction accuracy.

2. We then decided to look at the “statistics” of number of performers, segments, genres etc. We used R to normalize the data (training and test data) and to carry out K-means clustering, k =6 for genres, k=60 for performers, k=2000 for possible songs etc. Taking each set of clusters independently didn’t give any information. However, as we had pasted the results into the same file, we noticed a distinct pattern when the cluster results were looked at together – even though no crisp clusters were identified, we noticed that if a training instance was of a different genre from the rest of the cluster then it usually belonged to a different lower granularity cluster. We then built lots of cluster sets for the data (multi-resolution clustering). K was set to 6, 15, 20, 60, 300, 400, 600, 800, 900, 1050, 1200, 2000, 3000, 3200, 5000 and 7000 clusters. At the finest granularity cluster (k=7000) a majority cluster vote was taken using the training instance labels and the test set predictions – the whole cluster was relabelled to the “heaviest” class. If a cluster could not be converged at the finest k-level then we “fell back” to a lower granularity cluster (k=5000) and so on. These new predictions were fed back to the ECOC system and the process was repeated.

3. Figure below shows the overall approach we came up with:

4. This was the winning solution and resulted in 0.87507 score on the final test set. For the 2nd place solution, we decided to look at using the cluster assignation labels as feature vectors. This transformed the problem from the original 171-dimensional input space, into a new 16-dimensional space, where each attribute was an identifier of the cluster at one of the 16 levels. So, for example, if instance #7 have fallen into the 3rd out of 6 clusters at the first granularity level, 10th out of 15 clusters at the second granularity level and so on, in the transformed space it would be described as a 16-diemensional vector: [3, 10, …]. Note, that these attributes are now categorical, with up to 7000 distinct values at the highest granularity level. This has limited the number of classifiers we could use.

Our classification system consisted of:

1. Random forest of 1000 unpruned C4.5 decision trees

2. Boosted ensemble of 10 C5.0 decision trees

3. Cross-trained ensemble of 100 Naive Bayes classifiers, trained on different subsets of attributes, each time selected using the Floating Forward Feature Selection method.

We have used majority voting to combine the decisions of these 3 ensembles. After labeling the test dataset using the method described above, we have fed both training and test dataset (this time with the labels from the previous step) to the ECOC system to obtain final predictions. This resulted in 0.87270 on the final test set.

— Amanda Schierz, Marcin Budka, Edward Apeh

]]>I became interested in the ISMIS 2011 genres contest due to the challenge that some contestants noted in the online forum: standard model selection via cross-validation did not work well on the problem. Supervised learning techniques I tried, such as SVM, FDA, and Random Forest, all achieved accuracy in the 90-95% range in k-fold CV, only to result in leaderboard test set accuracy in the 70-76% range.

I interpreted this performance drop as an indication that the sample selection bias and resulting dataset shift was significant. I tried three categories of techniques in an attempt to produce a classifier that adapted to the test set distribution: standard transductive algorithms, importance weighting, and pseudo-labeling methods.

My best entry used what I call *Incremental Transductive Ridge Regression*. The procedure pseudo-labels test points progressively over multiple iterations in an attempt to gradually adapt the classifier to the test distribution. Labeled points can also be removed or reweighted over time to increase the significance of the unlabeled points. The objective function minimized in each iteration is the combination of a labeled loss term, a pseudo-labeled loss term, and the standard L_{2} ridge regularizer:

The response vector *y _{i}* for each point contains

I experimented with several techniques for growing an initially empty *U _{t}* across

In the end, I was able to achieve 82.5% leaderboard accuracy using *T*=10, *T _{II}*=5,

Along the way, I also experimented with semi-supervised manifold algorithms like LapRLS [1] and tried importance weighting using uLSIF [2], but found only modest gains. Other pseudo-labeling techniques that produced around 80% accuracy for me were Large Scale Manifold Transduction [3] and Tri-training [4].

For implementation, I programmed in Python/SciPy and utilized the ‘scikits.learn’ package when experimenting with off-the-shelf classifiers. Reported results involve two pre-processing steps: duplicate entries in the data sets were removed and features were normalized to have zero mean and unit variance.

I would like to thank TunedIT, the members of Gdansk University of Technology, and any others who helped put together this challenging and fun event.

— Brian S. Jones

1. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. *Journal of Machine Learning Research*.

2. Kanamori, T., Hido, S., & Sugiyama, M. (2009). A Least-squares Approach to Direct Importance Estimation. *Journal of Machine Learning Research*.

3. Karlen, M., Weston, J., Erkan, A., & Collobert, R. (2008). Large Scale Manifold Transduction. *Proceedings of the International Conference on Machine Learning*.

4. Zhou, Z.-H., & Li, M. (2005). Tri-training: exploiting unlabeled data using three classifiers. *IEEE Transactions on Knowledge and Data Engineering*.

The basis for my model was the Probabilistic Neural Network (PNN), which was originally introduced by Donald Specht in 1990. The PNN is an example-based classifier in which the ‘X’ vector for an unknown case to be classified is compared to all known-class cases used in the training set. A distance metric (typically Euclidean) is passed through a gaussian function to estimate a ‘probability’ of a match with each training case. These individual probabilities are combined for each class in the training set, and the class with the highest composite probability is selected as the most likely class for the unknown case. The evaluation and combination function used was:

Although a PNN can be used with little or no training, this problem posed several difficulties. The first was the high dimensionality of the input data. Because they are example based, PNN classifiers require their input ‘space’ to be reasonably well filled in order to perform. As the number of input features is increased, one would expect their input space to become exponentially sparser. The solution to this was to employ feature selection. Also, another challenge for obtaining good performance is the proper selection of σ, which controls the selectivity of the classifier. If one makes σ too large, the classifier will tend lose the ability to differentiate between different input data. On the other hand, if σ is made too small, the classifier loses the ability to generalize beyond its training set. The problems of both feature and σ selection were solved by using a guided random walk, with the objective of maximizing the Modified Youdon performance on the training set. One feature of this approach is that it does not require the calculation of gradient information, only the value of the metric being maximized. To avoid severe overtraining effects, a leave-one-out scheme was used to evaluate training-set performance.

Because the PNN model developed as described above only sees a small subset of available inputs, I decided to attempt to increase the performance through constructing ensembles of the PNNs, and then taking a simple vote among their outputs to decide the final classification.

As one can see from the following plot, there is substantial variation in both the training and final test Modified Youdon measures for different models, with a degree of correlation between the training metric and the final test metric. This led to the idea of constructing the final voting pool out of a subset of models with superior training performance.

In the end, the submission model consisted of a vote of the best 25 out of 135 candidate PNN models (by training score) constructed using 35 features. This yielded a training score of 0.794, and a final test score of 0.689. Note that while some individual sub-models would have had very similar performance to the ensemble model, there was no obvious way of reliably identifying such high-performing sub models *a priori*, so the ensemble technique allowed for the combination of a number of good (and not so good) models into a better one.

I developed the model generation code in Visual Basic .NET, and did the final vote taking using a spreadsheet. The generation and tuning of the 135 candidate models required nearly 8 hours on a single processor core of an Intel E5300.

— Ed Ramsden

]]>QSAR data provided for SIAM SDM’11 Contest were known to be highly noisy. Around 30% of labels provided could be wrong due to experimental uncertainty, as reported by the organizers after the contest was closed. Furthermore, this contest only counted the last submission, which means it was risky to overtune the models on the known data (including training data and preliminary test data).

In my approach, initially, a 7-fold cross validation strategy was adopted for modeling on the training data. Several classification algorithms were tried and the best CV results (in terms of Balanced Youden Index) were observed with R *gbm* and *randomForest* techniques. At that point the performance for *gbm* was 0.659/0.664/0.640 (in the order of 7-CV/preliminary/final), and for *rf* it was 0.636/0.718/0.628. (Of course, I only know the final performance after the contest is closed). I also tried different feature selection methods but I did not see obvious improvement so I decided to use all of the 242 features.

The next step I tried was to remove noisy data. The assumption was that an instance is likely to be noisy if it gets wrongly predicted with a high probability value. Such an idea was applied onto a balanced *gbm* modeling’s CV result. If the prediction value for a positive instance was less than *nplimit*, it was assumed to be a noise. Likewise, a negative instance was a noise if its prediction value was larger than or equal to *pplimit*. These noisy instances were removed and only the remaining instances were used for training the 2nd-round *gbm */ *randomForest* classifiers. After a few rounds of tuning, *nplimit* was set to 0.2 and *pplimit* to 0.8. Now I had the performance 0.688/0.644 (in the order of preliminary/final) for *gbm*, and 0.771/0.671 for *rf*.

Finally, the above process was applied to the combined training/preliminary data, but all modeling parameters were unchanged from the first phase. Step 1, a balanced *gbm* model was built. Step 2, noisy instances were removed on step 1 CV result with *nplimit*=0.2 and *pplimit*=0.8. Step 3, a *rf* model was built for final classification. Since different *rf* modelings have slightly different results, I actually built 9 *rf* models and picked the major voting as the final prediction, which was ranked at the 2nd place in this contest, with Balanced Youden Index of 0.6889.

There are two interesting observations. #1 the cut points for the final 9 *RF* models were between 0.207 and 0.213, which corresponds to the fact that “negatives outnumber positives approximately 4:1”. #2, the row numbers in the training/preliminary data of the 40 noisy instances in my final modeling were 11, 22, 45, 49, 74, 83, 104, 133, 196, 199, 268, 291, 322, 367, 368, 375, 378, 405, 412, 413, 415, 419, 454, 484, 515, 523, 535, 554, 606, 640, 675, 676, 680, 699, 729, 776, 791, 824, 825, 831. This might be useful for Simulations Plus to improve the experiments and labeling?

Thanks for the organizers that I had an opportunity to play with this interesting data.

Yuchun Tang, Ph.D.

Lead Principal Engineer

Global Threat Intelligence

McAfee Lab

http://www.linkedin.com/in/yuchuntang

http://www.mcafee.com/us/mcafee-labs/threat-intelligence.aspx

http://www.trustedsource.org

Safety of pharmaceutical and chemical products with respect to human health and the environment has been a major concern for the public, regulatory bodies, and the industry, for a long time and this demand is increasing. Safety aspects start in the early design phases of drugs and chemical compounds and they end formally with the official authorization by national and international regulators. Traditionally, for decades, animal tests have been using as the preferred accepted tool – kind of Gold Standard, which, in fact, it is not – for testing harmful effects of chemicals on living species or the environment. Currently, in Europe only, about 10 million animals per year are (ab)used for laboratory experiments, and a lot of time and billions of Euros are spent into these experiments. So, we as consumers who use and value chemical products every day everywhere in some form are safe? No! Not really. About 90% of the chemicals on the market today have never been tested or have not been requested, officially, to be tested. There is a simple reason, apparently: Despite the ethical issues of animal testing – it is estimated that additional 10 – 50 million vertebrate animals would be required if all 150,000 registered substances would have to be tested in this traditional way – it is simply not possible to run animal tests for this amount of substances within reasonable time and cost constraints. Animal tests cannot do that. To solve this problem, there is a strong demand for alternative testing methods like QSAR models to help minimizing and widely substituting animal tests in the future.

Many QSAR models for various chemical properties and biological endpoints have been published in the past 10 years, especially. However, most of them have been developed from a scientific viewpoint, only, and it is not clear if they are applicable for industrial and regulatory purposes. The current international research project ANTARES funded by the European Commission (LIFE+ LIFE08 ENV/IT/000435) targets this problem. It is searching and evaluating published QSAR models for a large number of endpoints using a set of quality, transparency, and reliability criteria important for identifying models, which can be used appropriately, and which are accepted by all parties involved, during official registration and authorization procedures of chemicals like the ongoing European initiative REACH.

Predictive modeling of a biological activity from the molecular structure of chemical compounds can be seen as a complex, ill-defined modeling problem, which is characterized by a number of methodological problems:

- Inadequate a priori information about the system for adequately describing the inherent system relationships. Creating models for predicting harmful effects on human health and the environment is a highly interdisciplinary challenge. There is no domain knowledge available from any single domain that would solve the problem by theory.
- Possessing a large number of variables. A few hundred to a few thousand input variables are not uncommon in QSAR modeling.
- Noisy and few data samples in the range of tens to a few hundred data.
- Vague and fuzzy objects whose variables have to be described adequately. Experimental toxicity data are result of animal tests. Depending on the species used in an assay its inherent bio-variability can be quite high and can vary very much from species to species and from test to test. This translates into huge amount of noise in the experimental data used to build QSAR models.

A powerful modeling technology that addresses these problems by its design is * Self-organizing Networks of Active Neurons* based on the Group Method of Data Handling. Built on the principles of self-organization, it inductively develops, starting from the simplest possible model, optimal complex models that are composed of sets of self-selected relevant inputs. In this way, it performs both parameter and structure identification of a model and it solves the basic problem of experimental systems analysis of systematically avoiding overfitted models based on the data’s information, only. Furthermore, the models are available analytically in form of linear or non-linear regression or difference equations. High-dimensional modeling from hundreds or thousands of input variables is another integrated part of Self-organizing Networks of Active Neurons that apply unique approaches to multilevel self-organization and noise immunity. This leads to the concept of self-organizing high-dimensional modeling, which hides the complex processes of knowledge extraction, model development, dimension reduction, variables selection, noise filtering for avoiding overfitted models, and model validation from the user as a condition

For the SIAM SDM’11 QSAR Challenge we used our general-purpose predictive modeling and data mining tool KnowledgeMiner (yX) for Excel out of the box. We also tested a new algorithm on cost-sensitive classification we are developing to see how it performs under real-world conditions. This algorithm also optimizes results of imbalanced class distributions as found in the challenge. The final solution submitted to the challenge is a model ensemble of two non-linear regression models obtained directly from the challenge data set of 837 samples and 242 descriptor variables. No prior dimension reduction, feature selection, data normalization or subdivision was used. All this is integrated in the knowledge extraction process of the tool.

This may sound as a very time consuming modeling task, but since KnowledgeMiner (yX) is 64-bit parallel software, it actually is not. To self-organize a model from the entire challenge data set takes about 1 – 5 minutes on a 3 GHz 8-core Mac Pro running Mac OS X 10.6.

The first model is composed of 14 and the other of 15 self-selected relevant molecular descriptors, which join to a unique set of 19 descriptors. The sensitivity of the final model on the design data is 0.714 and specificity is 0.707. On the out-of-sample challenge test data set this model shows a sensitivity and specificity of 0.711 and 0.677, respectively. All models can be exported to Excel for further use.

A free download of the software is available here.

I would like to thank the organizers for setting up this challenging task.

— Frank Lemke

]]>*** * ***

**By Robert Wieczorkowski, Ph.D. ( rwieczor).**

SIAM SDM’11 Contest was my second challenge in which I participated on TunedIT. Previously I took part in IEEE ICDM Contest (Traffic Prediction) and ended on the 12th place. Taking part in this challenge was for me a form of new practical experience in using modern statistical tools.

I graduated from Warsaw University in Mathematics (1988) and completed Ph.D. in Mathematics (1995) from Institute of Mathematics, Warsaw University of Technology. My area of expertise has been connected for many years with statistics and applying sampling methods in official statistical surveys. I had no specific knowledge in QSAR domain so I simply decided to use available methods and algorithms implemented in R software packages.

At first I selected important attributes from data using *Boruta* package (this package uses random forest classifier) on training data, which resulted in 75 variables out of 242. After that I experimented with different classifiers from R packages using 75 selected features. I tested various classifiers available in R, e.g. *randomForest*, *ada*, *gam*, *kknn*. The most promising results were obtained from *randomForest*, so I used later on mainly this classifier.

The measure of precision used in this challenge (based on specificity and sensitivity) leads to the problem of proper cutoff points for probability of belonging to a given class. Function *roc* from *caret* package enabled me to approximate cutoff points. Graphs based on *roc* function were also helpful. From simulations based on random samples from training data I obtained estimate of cutoff at around 0.21.

For fixed cutoff value I also tried to optimize parameters within *randomForest* classifier i.e. *ntree*, *mtry*, *classwt*. For given parameters I also generated many models trained on samples from training data, and used majority voting procedure.

After revealing preliminary test set my model based on averaging 100 random forest classifiers trained on independent samples from mixture of training data (70%) and preliminary test set (20%) were applied, and gave accuracy 0.6667 and 4th place on the leaderboard. Elementary classifier in my final model has the following parameters:

randomForest(ntree=50,mtry=30,classwt=c(6,1))

The challenge was a very interesting experience for me and I would like to thank the organizers and congratulate the winners.

— Robert Wieczorkowski

*** * ***

**By Yasser Tabandeh ( TmY).**

I’m a master student of software engineering at Shiraz University, Iran. My experiences with data mining began 2 years ago. I participated in PAKDD 2010 and UCSD 2010 data mining competitions, and it’s the 3rd time when I finish in 5th position in a data mining contest.

I didn’t have any prior knowledge of QSAR and used only data mining techniques for this challenge. In this contest, the number of instances was small in comparison to the number of features. So I didn’t make any submissions in preliminary phase and waited for labels of preliminary instances to be revealed, to make my model stronger and prevent overtraining.

I used Weka for my experiments and 5-fold cross validation. Performance measure used for evaluation of my models was AUC, because I think it has nearly direct relation with Youden Index. In my experiments, I found best results for 3 different models with three different feature sets:

- Logistic regression (ridge parameter: 1.0E-8)
- SMO (puk as kernel function, standardize data, build logistic models)
- KNN (K=5)

In the first two models, I used SVM-RFE for ranking features and then backward elimination of features was used to select final feature set. In the third model (KNN) I used my own Relief-based algorithm for weighting features and again, backward elimination of features was done in order to select final feature set.

After creating the 3 models, I voted on scores of these models in order to generate final scores. Voting was done through average of probabilities of 3 models. The best cut-off point in training data was 0.5619 and I selected this point for final test set. After releasing final test labels I found the best cut-off point was lower than this point; however it was a very risky selection.

To summarize, I think this contest had 3 main challenges:

- Creating a good model without overfitting
- Feature selection
- Selecting cut-off point

Thank all the organizers for this contest. ** **

— Yasser Tabandeh

*** * ***

**By Harris Papadopoulos, Ph.D. ( hnp).**

My solution was based on a recently developed framework, called Conformal Prediction (CP), for extending classical machine learning techniques to provide valid measures of confidence with their predictions. The reason for using this framework, since confidence measures were not the goal of this contest, was that a variation of CP, called Mondrian Conformal Prediction, can be used for producing confidence measures that are valid for each class of a problem separately, thing that partly addresses the class imbalance problem of the contest data. Another reason was to explore the possibility of further work on applying CP to the particular application area, where it can be used for rejecting the cases for which a positive result is unlikely at some confidence level. Starting with a high confidence level and then gradually reducing it can lead to some form of ranking for the order in which to experimentally evaluate candidate compounds. Anyone interested can find a number of Conformal Predictors described in the literature extending some of the most popular classification techniques, such as Support Vector Machines, Neural Networks, *k*‑Nearest Neighbours and Genetic Algorithms. The same goes for regression techniques where CP has been coupled with Ridge Regression, *k*‑Nearest Neighbours and Neural Networks.

The approach followed included four preprocessing steps. First outliers were removed using a method developed from the main ideas of the *k*‑Nearest Neighbours CP. This method evaluates how well each example fits in with all the other examples in the dataset and removes the ones that do not seem to belong to the dataset as well as the rest. The second preprocessing step aimed at reducing the class imbalance of the dataset. This was done by removing the examples of the majority class that were very different from those of the minority class. The third step again addressed the class imbalance problem by using the Synthetic Minority Over-sampling Technique (SMOTE) to generate more examples of the minority class. Finally, feature selection was performed by applying the ReliefF technique. The first two steps were implemented in Matlab, while for the last two the WEKA data mining toolkit was used.

Twelve different combinations of outlier elimination thresholds and number of top features selected were used in the preprocessing steps, thus generating twelve different datasets. These were then used to train various Mondrian CPs, based on *k*‑Nearest Neighbours, Naïve Bayes and Neural Networks; two of the Mondrian CPs were ensembles based on the aforementioned algorithms. The final classifications were then generated based on the p-values produced by all dataset-CP combinations for each class, following a voting scheme. All Mondrian CPs and the scheme for combining their outputs to produce the final classifications were implemented in Matlab.

— Harris Papadopoulos

SIAM SDM’11 Contest was my second challenge in which I participated on **TunedIT.**

Previously I took part in IEEE ICDM Contest (Traffic Prediction) and ended on 12th place. Taking part in this challenge was for me a form of new practical experience in using modern statistical tools.

I graduated from Warsaw University in mathematics (1988) and completed Ph.D. in Mathematics (1995) from Institute of Mathematics, Warsaw University of Technology. My area of expertise have been connected for many years with statistics and applying sampling methods in official statistical surveys.

I had no specific knowledge in QSAR domain so I simply decided to use available

methods and algorithms implemented in R software packages.

At first I selected important attributes from data using *Boruta *package (this package

uses random forest classifier) on training data, which resulted in 75 variables out of

242. After that I experimented with different classifiers from R packages using

75 selected features. I tested various classifiers available in R, e.g. *randomForest, ada, gam, kknn*. The most promising results were obtained from *randomForest,*

so I used later mainly this classifier.

The measure of precision used in this challenge (based on specificity and sensitivity) leads to the problem of proper cutoff points for probability of belonging to a given class. Function *roc *from *caret* package enabled me to approximate cutoff points. Graphs based on *roc *function were also helpful. From simulations based on random samples from training data I obtained estimate of cutoff around 0.21.

For fixed cutoff value I also tried to optimize parameters within *randomForest *classifier i.e. *ntree, mtry, classwt*. For given parameters I also generated many models trained on samples from training data, and used majority voting procedure.

After revealing preliminary test set my model based on averaging 100 random forest classifiers trained on independent samples from mixture of training data (70%) and preliminary test set (20%) were applied, and gave accuracy 0.6667 and 4th place on the leaderboard.

Elementary classifier in my final model has the following parameters:

randomForest(ntree=50,mtry=30,classwt=c(6,1))

The challenge was very interesting experience for me, and I would like to thank the organizers and congratulate the winners.

Robert Wieczorkowski

**If you want to ask the authors any questions feel free to comment below the post.**

By Łukasz Romaszko (*lukasz21212121*),** ****the winner, from the ****University of Warsaw****, Poland.**** **

The algorithm focuses on computing ranks for each street (the higher rank means greater probability of traffic jam) and the number of streets to be jammed. The highest ranked streets are given as the output. In particular it uses an adaptation of the k-Nearest Neighbors algorithm. The following description of the algorithm is simplified. The scheme presenting the algorithm idea is given below.

Computing street ranks and the number* p *of jammed streets consists of several steps. From the training data set there are chosen two ordered sets: *Similarity* and *LengthSimilarity*.* Similarity* is used to compute street ranks, while *LengthSimilarity* to determinate number *p*. On the basis of *Similarity* set and special functions algorithm generates an array RankingList. In the next step *RankingList* will be slightly modified by taking into consideration the locations of streets. The top *p* streets are given as the output.

*Algorithm*

Let us denote by *T* the test data: the set of identifiers of the 5 excluded segments followed by a sequence of major roads where the first jams occurred during initial 20 minutes of the simulation.

*Generating Similarity and LengthSimilarity ordered sets*

At the beginning the algorithm compares sequences from Training data to* T*. Two different measures were used to compute *Similarity *and* LengthSimilarity*. Sequences are compared based on the indices positions of roads which were jammed in both sequences. Let Δ be the difference between length of current sequence D and length of T. |T| denotes the number of jammed roads in T. The measure used to generate *Similarity* assumes that sequences of similar Δ length are more reliable. It is worth emphasizing that the metric used to generate *Similarity *set has the greatest influence on the result.

Sequences evaluation in *LengthSimilarity* was similar to that in* Similarity*, but took into consideration only sequences of Δ <=10. The algorithm for computing values in *Similarity* is described below:

M := f * |T|; { the best result was achieved with f = 1.10 } FOR each training data sequence D DO mS := 0; { used to evaluate D in Similarity } FOR each identifier Di at position i in D DO IF exists Tj in T that Tj = Di THEN mS := mS + max(0, M - |i-j|); Similarity(D) := Weight(Δ)*(mS/|D|); { Sort Similarity and LengthSimilarity decreasingly. }

*Analyzing Similarity and LengthSimilarity sets*

* Algorithm 1*: Roads Evaluation.

Streets which are jammed the most frequently in *Similarity* are thought to jam in the test case with the highest probability. The algorithm evaluates each street assuming that most similar sequences to *T *are the most reliable: Moreover, streets jammed at initial positions have greater influence on the street rank. The algorithm counts how often did streets occurred in the top N best sequences in *Similarity*.

*Algorithm 2*: Length Prediction of the output sequence.

The value of p depends on average sequence length in *LengthSimilarity*. The evaluation metric MAP appreciates listing of expected jammed streets even if they are listed after the expected total number of roads, therefore the output sequence should be longer than *averageLength* (factor 1.5).

*Improvements and the output*

For each street s if that street was evaluated to jam too frequently or too seldom than it should (in average based on local evaluation on tests created from part of Training data set), the *RankingList*[*s*] was multiplied by the factor < 1.0 or > 1.0, respectively. Later *RankingList* was sorted decreasingly. Subsequently, for each street in * Result*, we count the minimum distance to any other edge in *Result: a*. Connected edges have greater probability of getting jammed therefore they should be listed earlier. There are a few iterations, during which the roads order is changed slightly: streets identifiers in the output are swapped if any street has higher* a *value then adjacent. Eventually, the first *p* streets were given as the output.

*Results*

The Training data set was split into two parts: Training part (which consisted of 80% of whole data) and a validation part (20%). Parameters mentioned above were tuned independently by repeating the experiment and evaluating by the validation part. It was a very time consuming process. The code was written on Windows XP operating system in Java. Each execution time for 5000 cases on a processor 2.8Ghz was six minutes (about 0.07 s per one test case). This algorithm achieved the best result in the contest: 0.5598 points in the preliminary evaluation and 0.56056 in the final.

** ****By Benjamin Hamner ( hamner),**

*Transforming GPS Coordinates to Road Segments*

An algorithm was developed to rapidly determine which road segment a car was on given its GPS coordinate. The roads in Warsaw were preprocessed by laying a grid of points over the map and determining which road segments lay within a certain radius. This meant that, given a GPS coordinate, only the distance to road segments near the four closest grid points to the GPS coordinate needed evaluation. The preprocessing was reiterated with finer grids of points.

Though there are surely faster algorithms, this worked – it allowed the 200 million GPS readings in the competition data set to be transformed to road segments in under two days, instead of several years for the brute force approach. No corrections were made for GPS coordinates outside of Warsaw, which likely had a slight negative impact on the results.

*Preprocessing*

Three sets of features were extracted from the raw and transformed data, representing the local and aggregate traffic flow. The first set represented aggregate traffic flow: a grid of 16 points was put on the map of Warsaw. The number of cars closest to each point in two categories (stopped and moving) were counted for each half-hour time window, providing 32 features. The second set of features represented the aggregate traffic flow as well and utilized the transformed data. Three high-dimensional matrices were formed, one with the counts of moving cars on each road segment in each time window, one with the counts of stopped cars, and with the mean speed of the moving cars. A hacked-together version of Principal Component Analysis (PCA) that could run quickly on such high-dimensional matrices on a laptop was applied, and the first 12 principal components were taken for each matrix. This provided an additional 36 features.

The set of features for local traffic flow varied based on the edge being predicted. These features included the counts of moving and stopped cars along with the mean speed of moving cars on both the edge being predicted and the edges connected to it. This accounted for an additional 6-42 features, depending on the edge.

*Regression*

Two 100-tree random forests were trained for each of the 100 edges being forecast, one for the 31st-36th minute predictions, and one for the 54th-60th minute predictions. While some of the actual velocities had bimodal or trimodal distributions, the predictions were almost always unimodal (see the graph below).

To account for this, random forests were first trained to classify the data into different contexts if the velocity distributions were bimodal or trimodal, and then new random forests performed regression within each context. For example, on the edge in the above graph, a random forest was first trained to split the data into three groups: (1) likely having a speed < 20 km/hr, (2) likely having a speed from 20-60 km/hr, and (3) likely having a speed above 60 km/hr.

Making the regression context-dependent substantially improved the results. The graph below shows how parameterizations over subsets of the possible features affected the performance on the training portion of the dataset (split 50% train, 50% validation). For the aggregate counts, the parameter values 1-5 correspond to 4, 16, 64, 144, and 256 regions used to predict traffic flow. For the aggregate PCA model, the parameter values 1-6 correspond to 2, 4, 8, 12, 16, and 32 principal components used per matrix. Different parameters were not evaluated for models shown between parameter values 3 and 4.

*Computation*

Everything was done in Matlab on a 2.53 GHz MacBook Pro with 4 GB of RAM.

*Other Thoughts*

This was an entirely unprincipled approach to the problem, beyond “if it looks like it works, go with it.” This worked very well for the competition, and left a lot of room for future improvement. Tracking the movement of individual cars through time, extracting additional features, applying feature selection, and testing alternative regression and modeling techniques may improve results on the existing data set. Real-world data sets could have additional information that could enhance the predictive power of algorithms. These include data on the date, time, and weather, as well as traffic information from other data sources.

*** * ***

**By Andrzej Janusz**** ( NosferatoCorp)**

In this task, data consisted of a large number of GPS logs from cars driving through the streets of Warsaw acquired in a number of simulations. Road map of Warsaw was represented as a street graph. The objective of the task was to predict average velocities of cars passing through 100 selected edges.

The dataset was preprocessed prior to further analysis. For each of 50 training and 500 test simulations, the cars which passed through selected edges were extracted. The training simulations were divided into disjoint 6 minute time windows, corresponding to the available information about the true average velocity values at the selected edges, after 6 and 30 minutes time stamps. The time windows were described by 3×100 attributes expressing average velocities of cars passing through each of 100 selected edges, edges that end at the starting nodes of the selected edges and those which begin at the ending nodes of the edges from the task. Values of those attributes were computed using the filtered GPS logs.

A decision table was constructed in order to create a predictive model. For each simulation every 5 consecutive time windows were merged. Those descriptions of 30 minute periods were treated as objects. The training data contained 4550 such instances (each described by 5×300 attributes and having 200 target decision values) and the test set consisted of 500 instances. All the missing values were linearly interpolated.

The k-Nearest Neighbors regression model was used for making predictions. The standard algorithm was modified by introducing additional sets of weights of objects assigned to each of the target decision attributes. The weights expressed the average impact of particular objects at the squared error of predictions. They were computed during the initial cross-validation on the training set (a leave-one-out technique was used to minimize the temporal bias).

After the computation of the weights, the second cross-validation has been performed in which irrelevant attributes were filtered out (a standard correlation-based filter was used) and the *k* value was tuned (separately for each of the target decisions). Finally, the predictions were made. For every test instance, *k* nearest neighbors were selected (in Manhattan metric) and a weighted average of their decision values was taken. The predictions were additionally weighted based on distances of the neighbors from the tested objects. A standard triweight kernel was utilized. The model was implemented in R System using the “kknn” package.

*** * ***

**By Amr Kabardy and Nayer Wanas**** (team amrkabardy), 5th place. **

Traffic is repetitive in nature, both spatially and temporally, and in response to certain patterns and events. Given the data provided, and this observation, the approach aims to use the leading 30-mins, in a given segment, to identify the most similar 30-mins pattern of velocities in a segment of the same profile. The profile of a segment is based on the number of lanes and the estimated throughput. The cosine similarity of the estimated velocities is used as a similarity measure. Moreover, it is weighted based on the temporal and spatial separation between the segments. Estimation of the velocity includes smoothing the instantaneous velocities reported, identifying the most representative readings, removing outliers and averaging to produce an overall velocity of the segment at a given time. In the case of the lack of suitable data, the ratio of the estimated velocity to the maximum velocities of inbound and outbound segments is used to estimate the velocity. It is worth mentioning that smoothing the reported velocities through averaging on the 50 cycles produces better approximation, which in turn supports the original assumption.

*** * ***

**We thank all the authors for their descriptions and wish good luck in next TunedIT competitions!**

*M *:= *f ** * |T|*; (the best result was achieved with *f *= 1.10)

FOR each training data sequence *D *DO

*m**S *:= 0 (used to evaluate *D *in *Similarity*);

FOR each identifier *D**i *at position *i *in *D *DO

IF exists *T**j *in *T *that *T**j *= *D**i *THEN

*m**S *:= *m**S *+ *max*(0*, M – |i-j|*);* *

*Similarity*(*D*) := *W*eight(Δ)**(**m**S**/|D|)*;

Sort *Similarity *and *LengthSimilarity *decreasingly.

**Today, we publish descriptions for ****Task 1, “Traffic”****. In the nearest days we’ll make another post with Task 2 and 3 reports – stay tuned! We thank all the authors for their contributions.**

**If you want to ask the authors any questions, feel free to comment below the post.**

*** * *
**

**By Alexander Groznetsky ( alegro), the winner. Alex is an experienced data miner who had participated**

My solution was a linear mixture of about 20 predictors, based on three types of algorithms used with different parameters:

- Linear Least Squares (LLS),
- Supervised SVD-like factorization (Singular Value Decomposition),
- Restricted Boltzmann Machine (RBM) neural network.

The first one was based on weighted linear regression model. One set of regression parameters was computed for each target value (summary congestion at minutes 41-50 at road segment). Known target values were used as regressands. Averaged congestions per each segment were used as regressors. Regression weights (one per design matrix row) were computed as product of similarity and time distance from the target to the regressors averaging intervals. Limited amount of neighbors most similar to the predicted one was used for modeling. Several predictions were produced by this predictor with different averaging intervals, amounts of selected neighbors, using aligned or not aligned on hour boundary neighbors.

Similarity measure that was used in weighting and selections was computed in the following way:

S = P^a / E^b ,

where P, E – Pearson correlation coefficient and Euclidean distance between known congestion values; a, b – constants, in most cases a = 4, b = 6 were used.

The second predictor was based on the low rank SVD-like factorization of the congestion values in accordance with the matrix equation C = LR. Where C is matrix of the averaged in 10 minutes and weighted congestion values centered by subtraction of row and column means, some values of C are missing, L, R – learned left and right factors matrices. Dimensions of C were 2000×120, L 2000×16, R 16×120. L and R were initialized by random values and learned by gradient descent steps. Final predictions were made by solving weighted linear least squares problem defined by L and similarity measures. To produce cross-validated (CV) prediction values L and R were learned once for each CV fold.

Third predictor was based on Restricted Boltzmann Machine (RBM) neural network model. RBM with 1536 hidden units and ~58k visible units was used. Conditional Bernoulli distribution was used for modeling hidden units states. Softmax units were used as visible units. Each visible unit corresponds to one automobile. Vector of visible unit values corresponds to one hour of the simulation. RBM was trained by contrastive divergence method. Units with missed values do not made contribution to energy function of the network. Predictions were made by mean field update method.

About 20 predictions were produced by the described predictors and mixed by linear mixer with non-negative weights. The weights were computed separately for each of the road segments in accordance with the predicted CV values and some amount of similarity information. 100-fold CV was used in all predictions except RBM based (one CV fold per one simulation cycle). Two-fold CV was used in RBM based predictions.

I did not do any optimization of the algorithms in direction of computational efficiency. For example, several training passes (for cross validation) of the RBM with weight matrix 117k x 1.5k by 2M sample vectors (2k vectors x 1000 epochs) were performed. Each pass costs more than 2.8 petaFLOPs (117k*1.5k*2M*4*2 = 2.808e15) in matrix-matrix multiplications only. My algorithms – RBM and calculation of similarity – were implemented and executed on a GPU processor, so each pass took about 6 hours. On a standard quad-core CPU the same calculations, for every single pass, would take 40-60 hours.

*** * ***

**By ****Carlos J. Gil Bellosta ( datanalytics)**

The “Traffic” problem was described and discussed in depth elsewhere. But there is a main aspect to it that should be stressed here: it was required to make estimates of a multidimensional (length 20) vector. Most data mining methods, either published in books or implemented in software packages, only predict scalar responses.

After some brief visual inspection of the data, I started by trying to set the record straight and trying to save myself a precious time at later steps, this is:

- Choose a data analysis tool that would provide me with the required flexibility, that I could possibly run in the cloud if need be, etc. R package suited my needs perfectly.
- Create a good set of examples. I decided to group data into 10 minute slots and I could build a database of over 5000 independent samples.
- Build something like an abstract model which could combine any set of 20 models that fit unidimensional responses of my choice into a single one adapted to my multidimensional problem.
- Build a framework that would allow me to automate the selection of training and model validation, model fitting, and the creation of datasets to be submitted to the contest organizers. After this was achieved, testing a new model would be a matter of plug and play!

Book examples of highly parsimonious statistical methods (regressions, etc.) did not seem to provide satisfactory error rates. It was readily clear that the solution was much more local. The models that seemed to have a better behavior were those which retained a much higher amount of information from the training datasets: Random Forests, k-Nearest Neighbors, and the like.

Besides, I realized that not all these models mined the same wells of information in the training data. Some would ignore correlation structure among features, others would only pay attention to very narrow environments of new cases for prediction, etc.

This is why my final model was built as a *convex combination* —the convex combination that would minimize global error— of the three best models I had obtained so far, which was superior to any individual model. Error convexity means that inferior models can still help improve better models!

*** * ***

**By ****Benjamin Hamner ( hamner)**

Preprocessing: Three training data sets were considered:

*Set A* – 1000 data points, corresponding to the first half-hour of each hour-long window in the training set.

*Set B* – 11,000 data points, Set A + shifts in the training window in one-minute increments for 10 minutes.

*Set C* – 55,000 data points, Set B + all additional possible shifts.

Set A had the advantage that it did not contain redundant data, and that it was drawn from the same distribution of the test set. Sets B and C involved augmenting the data with additional points, but at the cost of potentially shifting the distribution of the training set away from that of the test set.

The mean RMSE of each of these sets on the training data set was evaluated (split 50% training, 50% validation, 100-tree random forests, 15-minute downsampling window). The results were:

*Set A* – 24.62

*Set B* – 22.64

*Set C* – 22.59

The ATR time series was downsampled by summing the counts of cars over the road segments in consecutive intervals. The graph below shows how the width of the window affected performance within the training set (split 50% training, 50% validation, 12-tree Random Forests on Set C).

Regression: 100-tree random forests were used for regression on the training data sets.

Ensembles of Random Forests: The most successful ensemble was:

20% – Set B, 15-minute downsampling window (Validation score: 24.829)

40% – Set C, 10-minute downsampling window (Validation score: 24.637)

40% – Set C, 15-minute downsampling window (Validation score: 24.595)

Ensemble – Validation score 24.411, Test score 25.337

*** * ***

**By Vladimir Nikulin (team UniQ), **

With the following approach we observed preliminary (leaderboard) result of 25.8. As a target variable we accepted the sum of the cars for the period between 41st and 50th minutes. Also, we used 600 features, which are the numbers of cars between 1st and 30th minutes for 20 given segments. Then, we produced 2 solutions using randomForest and GBM packages in R, which were linked together by an ensemble constructor as described in V. Nikulin: “Web-mining with Wilcoxon-based feature selection, ensembling and multiple binary classifiers.”* (See web-site of the PKDD2010 Discovery Challenge*). Further improvement to 24.502 was achieved by introducing 3 regulation parameters: p1 = 6 was used for smoothing of the features (moving averages), p2 = 8 – time interval (in minutes) between known and predicted data, p3 = 12 – smoothing parameter for the predicted time interval. Accordingly, the number of features was reduced to 500, and as a target we used the sum of cars for the period between 39th and 50th minutes. Cross-validation with 10 folds was used in order to optimise values of all regulation parameters.

Remark: The parameter p2 represents a novel compromise between training and test data: by making p2 smaller we are improving the quality of prediction (moving the target closer to the training data). On the other hand, we cannot go too far from the required test time-interval. By making p3 bigger compared to 10 we are simplifying the task. However, p3 cannot be too big. Alternatively, the task will suffer from over-smoothing. It is another compromise.

*** * ***

**We thank all the authors for their descriptions and congratulate on achieving top scores!**

** **

Task 1 – TrafficPreprocessing: Three training data sets were considered:Set A – 1000 data points, corresponding to the first half-hour of each hour-long window in the training set.

Set B – 11,000 data points, Set A + shifts in the training window in one-minute increments for 10 minutes.

Set C – 55,000 data points, Set B + all additional possible shifts.Set A had the advantage that it did not contain redundant data, and that it was drawn from the same distribution of the test set. Sets B and C involved augmenting the data with additional points, but at the cost of potentially shifting the distribution of the training set away from that of the test set.The mean RMSE of each of these sets on the training data set was evaluated (split 50% training, 50% validation, 100-tree random forests, 15-minute downsampling window). The results were:

]]>Set B – 11,000 data points, Set A + shifts in the training window in one-minute increments for 10 minutes.

Set C – 55,000 data points, Set B + all additional possible shifts.Set A had the advantage that it did not contain redundant data, and that it was drawn from the same distribution of the test set. Sets B and C involved augmenting the data with additional points, but at the cost of potentially shifting the distribution of the training set away from that of the test set.The mean RMSE of each of these sets on the training data set was evaluated (split 50% training, 50% validation, 100-tree random forests, 15-minute downsampling window). The results were:

Set A – 24.62

Set B – 22.64

Set C – 22.59

The ATR time series was downsampled by summing the counts of cars over the road segments in consecutive intervals. The graph below shows how the width of the window affected performance within the training set (split 50% training, 50% validation, 12-tree Random Forests on Set C).

Regression: 100-tree random forests were used for regression on the training data sets.

Ensembles of Random Forests: The most successful ensemble was:

20% – Set B, 15-minute downsampling window (Validation score: 24.829)

40% – Set C, 10-minute downsampling window (Validation score: 24.637)

40% – Set C, 15-minute downsampling window (Validation score: 24.595)

Ensemble – Validation score 24.411, Test score 25.337

Task 3 – GPS

Transforming GPS Coordinates to Road Segments: An algorithm was developed to rapidly determine which road segment a car was on given its GPS coordinate. The roads in Warsaw were preprocessed by laying a grid of points over the map and determining which road segments lay within a certain radius. This meant that, given a GPS coordinate, only the distance to road segments near the four closest grid points to the GPS coordinate needed evaluation. The preprocessing was reiterated with finer grids of points.

Though there are surely faster algorithms, this worked – it allowed the 200 million GPS readings in the competition data set to be transformed to road segments in under two days, instead of several years for the brute force approach. No corrections were made for GPS coordinates outside of Warsaw, which likely had a slight negative impact on the results.

Preprocessing: Three sets of features were extracted from the raw and transformed data, representing the local and aggregate traffic flow. The first set represented aggregate traffic flow: a grid of 16 points was put on the map of Warsaw. The number of cars closest to each point in two categories (stopped and moving) were counted for each half-hour time window, providing 32 features. The second set of features represented the aggregate traffic flow as well and utilized the transformed data. Three high-dimensional matrices were formed, one with the counts of moving cars on each road segment in each time window, one with the counts of stopped cars, and with the mean speed of the moving cars. A hacked-together version of Principal Component Analysis (PCA) that could run quickly on such high-dimensional matrices on a laptop was applied, and the first 12 principal components were taken for each matrix. This provided an additional 36 features.

The set of features for local traffic flow varied based on the edge being predicted. These features included the counts of moving and stopped cars along with the mean speed of moving cars on both the edge being predicted and the edges connected to it. This accounted for an additional 6-42 features, depending on the edge.

Regression: Two 100-tree random forests were trained for each of the 100 edges being forecast, one for the 31st-36th minute predictions, and one for the 54th-60th minute predictions. While some of the actual velocities had bimodal or trimodal distributions, the predictions were almost always unimodal (see the graph below).

To account for this, random forests were first trained to classify the data into different contexts if the velocity distributions were bimodal or trimodal, and then new random forests performed regression within each context. For example, on the edge in the above graph, a random forest was first trained to split the data into three groups: (1) likely having a speed < 20 km/hr, (2) likely having a speed from 20-60 km/hr, and (3) likely having a speed above 60 km/hr.

Making the regression context-dependent substantially improved the results. The graph below shows how parameterizations over subsets of the possible features affected the performance on the training portion of the dataset (split 50% train, 50% validation). For the aggregate counts, the parameter values 1-5 correspond to 4, 16, 64, 144, and 256 regions used to predict traffic flow. For the aggregate PCA model, the parameter values 1-6 correspond to 2, 4, 8, 12, 16, and 32 principal components used per matrix. Different parameters were not evaluated for models shown between parameter values 3 and 4.

Computation: Everything was done in Matlab on a 2.53 GHz MacBook Pro with 4 GB of RAM.

Other Thoughts: This was an entirely unprincipled approach to the problem, beyond “if it looks like it works, go with it.” This worked very well for the competition, and left a lot of room for future improvement. Tracking the movement of individual cars through time, extracting additional features, applying feature selection, and testing alternative regression and modeling techniques may improve results on the existing data set. Real-world data sets could have additional information that could enhance the predictive power of algorithms. These include data on the date, time, and weather, as well as traffic information from other data sources.

*Magdalena Pancewicz*, TunedIT: How did it happen that data mining became your field of interest?

*Vladimir Nikulin*: I’ve got a PhD in Mathematical Statistics from the Moscow State University. I worked in Russia as a scientific programmer and then as an Associate Professor in Mathematics & Statistics. When I came to Australia in 1993, I had a very solid academic background, and it wasn’t difficult for me to study the most advanced methods of computational statistics and data mining (DM). The major progress in this area I made after joining the Australian National University in Canberra, where I worked for over 2 years, publishing papers and attending several international conferences. Also, I participated in data mining competitions, and won my first challenge in 2007: *Agnostic Learning vs. Prior Knowledge Challenge*.

Theoretical statistics, which I studied originally, has very limited applicability compared to data mining, because of idealistic models and assumptions which are too general. I think that data mining and computational statistics [don’t confuse with *theoretical *statistics – eds.] are one of the fastest growing and most promising scientific areas, because high technologies are generating huge databases, which can’t be handled manually. That’s why I went into DM.

**What is, in your opinion, the future of data mining?**

There are no alternatives to data mining. This is not a theoretical area so it is not about proving some sort of theorems. There are many datasets, there are many methods of data mining, and the most important is to find the right match. To use proper methods against a particular dataset, in a proper way.

**How to do this?**

Be flexible. There are many regulation parameters, and, I think, the key in selection and evaluation of algorithms is cross-validation or leave-one-out method, which is applicable in the case of small sample sizes. Leave-one-out may be regarded as a perfect form of cross-validation. Today at the RSCTC conference I gave a talk about my methods. We use a very flexible framework which is based on cross-validation, where we are free to select the most suitable feature selection and classification methods. Using this framework we can make the right choice of the models and their parameters. Besides, the models themselves are very flexible and dependent on many regulation parameters.

Running cross validation requires a lot of computational time, so we use a powerful machine – monstrous Linux workstation at the University of Queensland. In many cases this station is working constantly including overnights and weekends. It is a multi-processor computer, which produces large amounts of numerical results. It’s necessary to maintain all the records very carefully. Basically, I’m recording the results in a proper form including many graphical illustrations, and based on these tables/graphs I’m doing the analysis: which particular models are better, which particular parameter settings are better. Typically, I come to work in the morning and collect new results produced overnight, which give me new directions where to go next. I’m giving the computer new settings, and the cycle is repeating again and again, until the deadline of the contest.

**How long does it usually take to find a solution?**

In some cases the solution itself may be very simple, but difficult to discover. In the case of data mining competitions time is very limited. If you consider for example the last year ICDM competition, the task was very complex: clustering of three-dimensional curves. When you scan the brain, it is represented in computer memory as quite big, three-dimensional image, about 0.5 GB in size. It is necessary to group all these fibers into clusters, into bundles, they call it. So it is very difficult task but, again, why did I win this competition? I just found the right direction from one of the earliest steps, and followed this direction. Usually the difference between the first and the second best solution in data mining competitions may be 1-2%, 3-5% is quite significant. In the case of latest ICDM 2009 competition the difference between the first and the second result was more than 50%!

**So finding a solution in data mining is like choosing and following the right path?**

Yes. For example, in the latest ACM KDD Cup competition – prediction of students’ performance in the area of mathematics – I wasn’t so lucky. The data were complex and required right preprocessing. I made some mistakes at that stage and my results were far from good. During the last 24 hours of the competition I found some of these mistakes and made significant progress but the time was over. I needed about 1-2 weeks more, but there was no deadline extension. The KDD Cup is an annual event, so we’ll see, maybe the next time I’ll have more luck. On the other hand, there’s a direct correspondence between “luck” and experience.

**And what problems did you face in RSCTC 2010 Challenge?**

Bioinformatics is directly relevant to our (Statistics Group at the UQ) research interests, it’s our major research direction. Just before this challenge I published several papers on classification of high-dimensional data and had three publications in the area of bioinformatics, so I was quite well prepared. It was very interesting and important challenge.

**Why did you decide to take part in this competition? Only because you were well prepared?**

Yes, and, on the other hand, in order to be prepared I have to participate. Right now I am participating in some other competitions. I am sitting here, but I am participating. My computers are working and my PhD students are working as well.

**And is there any difference between data in various competitions? What differs biological and, say, psychological (student behavior) data?**

Of course, there are some differences, but they are not significant. Data mining is like mathematics – it is general scientific area and the methods of data mining are general methods.

**What do you think about competitions? Are they a good way for scientists to carry out research? Are they a good way for companies to “crowdsource” efficient algorithms?**

Yes. Data mining competitions represent a rapidly growing and very important part of computational statistics. Practically every large company in the world has a data mining department, responsible for data analysis and modeling. These activities may be expensive, but they’re unavoidable. And participation in data mining competitions may be useful for wide range of researchers including academics, consultants and students, because practical experience is the best way to learn. Moreover, competitions represent a very important form of online (convincing) evaluation of different models and algorithms.

**What books and journals would you recommend for PhD students or others interested in data mining?**

The most famous book is Friedman’s *Elements of Statistical Learning* [T. Hastie, R. Tibshirani, J. Friedman, *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*, 2009]. It may be regarded as a handbook. Jerome Friedman and his joint authors are well known and respected scientists from the Stanford University. The book contains all necessary guidance to develop initial skills in this particular area. And then… Google! With Google you’ll find whatever you like. There are plenty of papers on the Internet, available for free.

**And what about software? Tools, frameworks, libraries.**

There are very promising packages in R, like GBM. Also, ADA, however ADA in most cases is not as good as GBM. Also, I would recommend Random Forest… there are many methods. But in many cases you can’t rely on standard software. For example, in the previous ICDM competition, where the task was very specific and complicated, I didn’t use any standard software at all, only my own tools, written in C or MATLAB. Also, I can recommend Perl, which may be very handy to work with textual information.

**Do you enjoy Poland and the RSCTC conference?**

Poland is a very good and friendly country, and the RSCTC conference is very well organized, no problems at all. I’d like to congratulate the organisers for such an impressive event.

**Do you find any topics particularly interesting?**

Today there was a very interesting session. Andrzej Janusz, the winner of Australian AusDM 2009 Analytic Challenge, made a presentation regarding his participation. I was a runner-up in that event, so it was very interesting for me. Also the session on classification of imbalanced data was interesting.

**You won the Basic Track of RSCTC challenge. What do you think about the winning solution of Advanced Track?**

We produced very impressive results (according to the Leaderboard) with the multilayer perceptron, but we relied on this model too much. Probably, it would be better to make an alternative submission with Support Vector Machines. That’s what the Chinese team, winners of the Advanced Track, did. In most cases it’s a good idea to consider some alternatives.

**Apart from the competition, do you work with bioinformatics data?**

Professor Geoff McLachlan with whom I cooperate has a joint academic appointment at University of Queensland and at the Institute for Molecular Bioscience. He works with real-world bioinformatics data and I, also, participate in this research work. We may expect that our internationally recognized results will help us increase, or at least maintain, our current funding opportunities from the Australian Research Council.

**Regarding research and competitions, is it easier to work on your own or in a team?**

Depends on what sort of a team. Team means diversity of skills – this is the key element of the team work. If everybody had the same sort of skills as me, what sense would it make to work in such a team? I worked for the Advanced Track in a team with Tian-Hsiang Huang, a PhD student from Taiwan, and it was great. What’s difficult for me, for Tian-Hsiang was easy. He’s very skillful in Java programming. Also, he developed my website. In some cases, he doesn’t understand very well the principles of statistics and mathematics – but this is easy for me. Team may be great.

**Do you intend to take part in the latest competition organized by TunedIT – the ICDM Contest related to road traffic prediction?**

Yes, we’re interested. We’ll consider this task very seriously. However, time series analysis is not my major strength, there are also some problems with free time. But we’ll do our best.

**You’ve mentioned Russia. Where exactly do you come from?**

Kirov is my native city, 600 km to the east from Moscow. It is nice regional city, and there is still a good environment. Frankly, it’s too exhausting for me to live in such a huge city as Moscow.

**And do you like the place you live in, Australia?**

I have some professional and financial interests there, but without such interests I wouldn’t stay in Australia.

**So where?**

I have a workplace in Kirov at the Vyatka State University. I have accommodation there and colleagues who consider me a staff member. In Russia, it’s a very interesting time now. Russia is dynamic and vibrant. Every year I spend holidays there.

**You’re also taking part in many conferences around the world. Which place among those you’ve visited is your favorite one? **

Washington is very interesting, but very expensive too. There are plenty of museums near the administrative centre and the White House. For example, the Museums of Air-space and Natural History were very impressive. In Genoa, Italy, where I was in October last year, I enjoyed the soft mediterranean climate. In Hong Kong, quite the contrary, the climate was terrible. I was there approximately in about this time, end of June. I like the weather here, in Poland, it’s warm and sunny. Now, I’m looking forward to visit Barcelona in about one week time – I’ll have a presentation there, at the WCCI 2010, about my solution for ICDM 2009 Contest (you may read the paper).

**What do you do in your free time?**

I like cycling very much. There are good facilities for cycling in Brisbane. Some sport activities are necessary, to do maintenance for yourself. Also, I like reading, but, frankly, have less and less time to do that.

**Thank you for the interview.**

*Details of Vladimir’s algorithm can be found in the report.*