Winners’ notes. Robert Wieczorkowski, Yasser Tabandeh and Harris Papadopoulos on SIAM SDM’11 Contest
February 11, 2011 Leave a comment
We have a pleasure to publish three after-challenge reports authored by participants of SIAM DM 2011 Data Mining Contest who achieved ex aequo the 4th best score (differed only in time of submission). Hope you’ll find the reports insightful. To view full results of the competition see Leaderboard and Summary pages.
* * *
By Robert Wieczorkowski, Ph.D. (rwieczor).
SIAM SDM’11 Contest was my second challenge in which I participated on TunedIT. Previously I took part in IEEE ICDM Contest (Traffic Prediction) and ended on the 12th place. Taking part in this challenge was for me a form of new practical experience in using modern statistical tools.
I graduated from Warsaw University in Mathematics (1988) and completed Ph.D. in Mathematics (1995) from Institute of Mathematics, Warsaw University of Technology. My area of expertise has been connected for many years with statistics and applying sampling methods in official statistical surveys. I had no specific knowledge in QSAR domain so I simply decided to use available methods and algorithms implemented in R software packages.
At first I selected important attributes from data using Boruta package (this package uses random forest classifier) on training data, which resulted in 75 variables out of 242. After that I experimented with different classifiers from R packages using 75 selected features. I tested various classifiers available in R, e.g. randomForest, ada, gam, kknn. The most promising results were obtained from randomForest, so I used later on mainly this classifier.
The measure of precision used in this challenge (based on specificity and sensitivity) leads to the problem of proper cutoff points for probability of belonging to a given class. Function roc from caret package enabled me to approximate cutoff points. Graphs based on roc function were also helpful. From simulations based on random samples from training data I obtained estimate of cutoff at around 0.21.
For fixed cutoff value I also tried to optimize parameters within randomForest classifier i.e. ntree, mtry, classwt. For given parameters I also generated many models trained on samples from training data, and used majority voting procedure.
After revealing preliminary test set my model based on averaging 100 random forest classifiers trained on independent samples from mixture of training data (70%) and preliminary test set (20%) were applied, and gave accuracy 0.6667 and 4th place on the leaderboard. Elementary classifier in my final model has the following parameters:
The challenge was a very interesting experience for me and I would like to thank the organizers and congratulate the winners.
– Robert Wieczorkowski
* * *
By Yasser Tabandeh (TmY).
I’m a master student of software engineering at Shiraz University, Iran. My experiences with data mining began 2 years ago. I participated in PAKDD 2010 and UCSD 2010 data mining competitions, and it’s the 3rd time when I finish in 5th position in a data mining contest.
I didn’t have any prior knowledge of QSAR and used only data mining techniques for this challenge. In this contest, the number of instances was small in comparison to the number of features. So I didn’t make any submissions in preliminary phase and waited for labels of preliminary instances to be revealed, to make my model stronger and prevent overtraining.
I used Weka for my experiments and 5-fold cross validation. Performance measure used for evaluation of my models was AUC, because I think it has nearly direct relation with Youden Index. In my experiments, I found best results for 3 different models with three different feature sets:
- Logistic regression (ridge parameter: 1.0E-8)
- SMO (puk as kernel function, standardize data, build logistic models)
- KNN (K=5)
In the first two models, I used SVM-RFE for ranking features and then backward elimination of features was used to select final feature set. In the third model (KNN) I used my own Relief-based algorithm for weighting features and again, backward elimination of features was done in order to select final feature set.
After creating the 3 models, I voted on scores of these models in order to generate final scores. Voting was done through average of probabilities of 3 models. The best cut-off point in training data was 0.5619 and I selected this point for final test set. After releasing final test labels I found the best cut-off point was lower than this point; however it was a very risky selection.
To summarize, I think this contest had 3 main challenges:
- Creating a good model without overfitting
- Feature selection
- Selecting cut-off point
Thank all the organizers for this contest.
– Yasser Tabandeh
* * *
By Harris Papadopoulos, Ph.D. (hnp).
My solution was based on a recently developed framework, called Conformal Prediction (CP), for extending classical machine learning techniques to provide valid measures of confidence with their predictions. The reason for using this framework, since confidence measures were not the goal of this contest, was that a variation of CP, called Mondrian Conformal Prediction, can be used for producing confidence measures that are valid for each class of a problem separately, thing that partly addresses the class imbalance problem of the contest data. Another reason was to explore the possibility of further work on applying CP to the particular application area, where it can be used for rejecting the cases for which a positive result is unlikely at some confidence level. Starting with a high confidence level and then gradually reducing it can lead to some form of ranking for the order in which to experimentally evaluate candidate compounds. Anyone interested can find a number of Conformal Predictors described in the literature extending some of the most popular classification techniques, such as Support Vector Machines, Neural Networks, k‑Nearest Neighbours and Genetic Algorithms. The same goes for regression techniques where CP has been coupled with Ridge Regression, k‑Nearest Neighbours and Neural Networks.
The approach followed included four preprocessing steps. First outliers were removed using a method developed from the main ideas of the k‑Nearest Neighbours CP. This method evaluates how well each example fits in with all the other examples in the dataset and removes the ones that do not seem to belong to the dataset as well as the rest. The second preprocessing step aimed at reducing the class imbalance of the dataset. This was done by removing the examples of the majority class that were very different from those of the minority class. The third step again addressed the class imbalance problem by using the Synthetic Minority Over-sampling Technique (SMOTE) to generate more examples of the minority class. Finally, feature selection was performed by applying the ReliefF technique. The first two steps were implemented in Matlab, while for the last two the WEKA data mining toolkit was used.
Twelve different combinations of outlier elimination thresholds and number of top features selected were used in the preprocessing steps, thus generating twelve different datasets. These were then used to train various Mondrian CPs, based on k‑Nearest Neighbours, Naïve Bayes and Neural Networks; two of the Mondrian CPs were ensembles based on the aforementioned algorithms. The final classifications were then generated based on the p-values produced by all dataset-CP combinations for each class, following a voting scheme. All Mondrian CPs and the scheme for combining their outputs to produce the final classifications were implemented in Matlab.
– Harris Papadopoulos