Winner’s notes. Yuchun Tang on noise deduction to improve classification accuracy in SIAM SDM’11 Contest
February 22, 2011 Leave a comment
By Yuchun Tang (piaopiao), the runner-up in SIAM SDM’11 Contest.
QSAR data provided for SIAM SDM’11 Contest were known to be highly noisy. Around 30% of labels provided could be wrong due to experimental uncertainty, as reported by the organizers after the contest was closed. Furthermore, this contest only counted the last submission, which means it was risky to overtune the models on the known data (including training data and preliminary test data).
In my approach, initially, a 7-fold cross validation strategy was adopted for modeling on the training data. Several classification algorithms were tried and the best CV results (in terms of Balanced Youden Index) were observed with R gbm and randomForest techniques. At that point the performance for gbm was 0.659/0.664/0.640 (in the order of 7-CV/preliminary/final), and for rf it was 0.636/0.718/0.628. (Of course, I only know the final performance after the contest is closed). I also tried different feature selection methods but I did not see obvious improvement so I decided Read more of this post