Winners’ notes. Brian Jones on Incremental Transductive Ridge Regression
April 6, 2011 1 Comment
I became interested in the ISMIS 2011 genres contest due to the challenge that some contestants noted in the online forum: standard model selection via cross-validation did not work well on the problem. Supervised learning techniques I tried, such as SVM, FDA, and Random Forest, all achieved accuracy in the 90-95% range in k-fold CV, only to result in leaderboard test set accuracy in the 70-76% range.
I interpreted this performance drop as an indication that the sample selection bias and resulting dataset shift was significant. I tried three categories of techniques in an attempt to produce a classifier that adapted to the test set distribution: standard transductive algorithms, importance weighting, and pseudo-labeling methods.
My best entry used what I call Incremental Transductive Ridge Regression. The procedure pseudo-labels test points progressively over multiple iterations in an attempt to gradually adapt the classifier to the test distribution. Labeled points can also be removed or reweighted over time to increase the significance of the unlabeled points. The objective function minimized in each iteration is the combination of a labeled loss term, a pseudo-labeled loss term, and the standard L2 ridge regularizer:
The response vector yi for each point contains K entries, one for each genre, and is encoded in binary format where yik=1 if point i has label k and 0 otherwise. Other coding schemes are possible, for example using error-correcting output codes or (K-1) orthogonal vectors. The variable yi* is a pseudo-label vector for each unlabeled point, and Lt and Ut represent the sets of labeled and unlabeled point indices utilized in iteration t. The function f is a linear predictor with weights w, and predictions are produced by argmax f(x).
I experimented with several techniques for growing an initially empty Ut across T iterations. The most successful approach was a stratified one, adding the most confident Fk / T predictions for every class in each round. Confidence is determined by the multiclass margin, and Fk is the expected frequency of class k based on the labeled class distribution. I kept all labeled points in Lt during the T iterations, but surprisingly found that performance increased by removing them all at the end and self-training for a few extra iterations (TII) using just the test points.
In the end, I was able to achieve 82.5% leaderboard accuracy using T=10, TII=5, C=1, λ=1. I added another 0.5% by combining several of these classifiers in a voting ensemble, where diversity was introduced by bootstrap sampling the labeled set. This increase may have been spurious, however, as it did not provide similar gains at larger ensemble sizes.
Along the way, I also experimented with semi-supervised manifold algorithms like LapRLS  and tried importance weighting using uLSIF , but found only modest gains. Other pseudo-labeling techniques that produced around 80% accuracy for me were Large Scale Manifold Transduction  and Tri-training .
For implementation, I programmed in Python/SciPy and utilized the ‘scikits.learn’ package when experimenting with off-the-shelf classifiers. Reported results involve two pre-processing steps: duplicate entries in the data sets were removed and features were normalized to have zero mean and unit variance.
I would like to thank TunedIT, the members of Gdansk University of Technology, and any others who helped put together this challenging and fun event.
— Brian S. Jones
1. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. Journal of Machine Learning Research.
2. Kanamori, T., Hido, S., & Sugiyama, M. (2009). A Least-squares Approach to Direct Importance Estimation. Journal of Machine Learning Research.
3. Karlen, M., Weston, J., Erkan, A., & Collobert, R. (2008). Large Scale Manifold Transduction. Proceedings of the International Conference on Machine Learning.
4. Zhou, Z.-H., & Li, M. (2005). Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering.