# Winners’ notes. Brian Jones on Incremental Transductive Ridge Regression

April 6, 2011 1 Comment

**By Brian S. Jones ( wahoo) from Sandia National Laboratories, USA, 3rd in Music Genres track of ISMIS 2011 Contest: Music Information Retrieval.**

I became interested in the ISMIS 2011 genres contest due to the challenge that some contestants noted in the online forum: standard model selection via cross-validation did not work well on the problem. Supervised learning techniques I tried, such as SVM, FDA, and Random Forest, all achieved accuracy in the 90-95% range in k-fold CV, only to result in leaderboard test set accuracy in the 70-76% range.

I interpreted this performance drop as an indication that the sample selection bias and resulting dataset shift was significant. I tried three categories of techniques in an attempt to produce a classifier that adapted to the test set distribution: standard transductive algorithms, importance weighting, and pseudo-labeling methods.

My best entry used what I call *Incremental Transductive Ridge Regression*. The procedure pseudo-labels test points progressively over multiple iterations in an attempt to gradually adapt the classifier to the test distribution. Labeled points can also be removed or reweighted over time to increase the significance of the unlabeled points. The objective function minimized in each iteration is the combination of a labeled loss term, a pseudo-labeled loss term, and the standard L_{2} ridge regularizer:

The response vector *y _{i}* for each point contains

*K*entries, one for each genre, and is encoded in binary format where

*y*=1 if point

_{ik}*i*has label

*k*and 0 otherwise. Other coding schemes are possible, for example using error-correcting output codes or (

*K*-1) orthogonal vectors. The variable

*y** is a pseudo-label vector for each unlabeled point, and

_{i}*L*and

_{t}*U*represent the sets of labeled and unlabeled point indices utilized in iteration

_{t}*t*. The function

*f*is a linear predictor with weights

*w*, and predictions are produced by argmax

*f*(

*x*).

I experimented with several techniques for growing an initially empty *U _{t}* across

*T*iterations. The most successful approach was a stratified one, adding the most confident

*F*predictions for every class in each round. Confidence is determined by the multiclass margin, and

_{k}/ T*F*is the expected frequency of class

_{k}*k*based on the labeled class distribution. I kept all labeled points in

*L*during the

_{t}*T*iterations, but surprisingly found that performance increased by removing them all at the end and self-training for a few extra iterations (

*T*) using just the test points.

_{II}In the end, I was able to achieve 82.5% leaderboard accuracy using *T*=10, *T _{II}*=5,

*C*=1,

*λ*=1. I added another 0.5% by combining several of these classifiers in a voting ensemble, where diversity was introduced by bootstrap sampling the labeled set. This increase may have been spurious, however, as it did not provide similar gains at larger ensemble sizes.

Along the way, I also experimented with semi-supervised manifold algorithms like LapRLS [1] and tried importance weighting using uLSIF [2], but found only modest gains. Other pseudo-labeling techniques that produced around 80% accuracy for me were Large Scale Manifold Transduction [3] and Tri-training [4].

For implementation, I programmed in Python/SciPy and utilized the ‘scikits.learn’ package when experimenting with off-the-shelf classifiers. Reported results involve two pre-processing steps: duplicate entries in the data sets were removed and features were normalized to have zero mean and unit variance.

I would like to thank TunedIT, the members of Gdansk University of Technology, and any others who helped put together this challenging and fun event.

— Brian S. Jones

### References

1. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. *Journal of Machine Learning Research*.

2. Kanamori, T., Hido, S., & Sugiyama, M. (2009). A Least-squares Approach to Direct Importance Estimation. *Journal of Machine Learning Research*.

3. Karlen, M., Weston, J., Erkan, A., & Collobert, R. (2008). Large Scale Manifold Transduction. *Proceedings of the International Conference on Machine Learning*.

4. Zhou, Z.-H., & Li, M. (2005). Tri-training: exploiting unlabeled data using three classifiers. *IEEE Transactions on Knowledge and Data Engineering*.

Excellent Job! and Thanks for sharing.