Winner’s notes. Yuchun Tang on noise deduction to improve classification accuracy in SIAM SDM’11 Contest

By Yuchun Tang (piaopiao), the runner-up in SIAM SDM’11 Contest.

QSAR data provided for SIAM SDM’11 Contest were known to be highly noisy. Around 30% of labels provided could be wrong due to experimental uncertainty, as reported by the organizers after the contest was closed. Furthermore, this contest only counted the last submission, which means it was risky to overtune the models on the known data (including training data and preliminary test data).

In my approach, initially, a 7-fold cross validation strategy was adopted for modeling on the training data. Several classification algorithms were tried and the best CV results (in terms of Balanced Youden Index) were observed with R gbm and randomForest techniques. At that point the performance for gbm was 0.659/0.664/0.640 (in the order of 7-CV/preliminary/final), and for rf it was 0.636/0.718/0.628. (Of course, I only know the final performance after the contest is closed). I also tried different feature selection methods but I did not see obvious improvement so I decided to use all of the 242 features.

The next step I tried was to remove noisy data. The assumption was that an instance is likely to be noisy if it gets wrongly predicted with a high probability value. Such an idea was applied onto a balanced gbm modeling’s CV result. If the prediction value for a positive instance was less than nplimit, it was assumed to be a noise. Likewise, a negative instance was a noise if its prediction value was larger than or equal to pplimit. These noisy instances were removed and only the remaining instances were used for training the 2nd-round gbm / randomForest classifiers. After a few rounds of tuning, nplimit was set to 0.2 and pplimit to 0.8. Now I had the performance 0.688/0.644 (in the order of preliminary/final) for gbm, and 0.771/0.671 for rf.

Finally, the above process was applied to the combined training/preliminary data, but all modeling parameters were unchanged from the first phase. Step 1, a balanced gbm model was built. Step 2, noisy instances were removed on step 1 CV result with nplimit=0.2 and pplimit=0.8. Step 3, a rf model was built for final classification. Since different rf modelings have slightly different results, I actually built 9 rf models and picked the major voting as the final prediction, which was ranked at the 2nd place in this contest, with Balanced Youden Index of 0.6889.

There are two interesting observations. #1 the cut points for the final 9 RF models were between 0.207 and 0.213, which corresponds to the fact that “negatives outnumber positives approximately 4:1″. #2, the row numbers in the training/preliminary data of the 40 noisy instances in my final modeling were 11, 22, 45, 49, 74, 83, 104, 133, 196, 199, 268, 291, 322, 367, 368, 375, 378, 405, 412, 413, 415, 419, 454, 484, 515, 523, 535, 554, 606, 640, 675, 676, 680, 699, 729, 776, 791, 824, 825, 831. This might be useful for Simulations Plus to improve the experiments and labeling?

Thanks for the organizers that I had an opportunity to play with this interesting data.

Yuchun Tang, Ph.D.
Lead Principal Engineer
Global Threat Intelligence
McAfee Lab

http://www.linkedin.com/in/yuchuntang
http://www.mcafee.com/us/mcafee-labs/threat-intelligence.aspx
http://www.trustedsource.org

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: