IEEE ICDM Contest – Overview of Top Solutions, part 1
October 26, 2010 10 Comments
The IEEE ICDM Contest: TomTom Traffic Prediction for Intelligent GPS Navigation came to an end. As promised, we publish descriptions of top solutions, provided by participants. Although the reports had to be brief, the authors not only revealed a good deal of important details about their approaches, but also kept the descriptions straightforward and concise, giving all of us an unprecedented opportunity to learn the essence of data mining know-how. This is a good supplement to fully scientific articles that will be presented during Contest Workshop at the ICDM conference in Sydney.
If you want to ask the authors any questions, feel free to comment below the post.
* * *
By Alexander Groznetsky (alegro), the winner. Alex is an experienced data miner who had participated (nick orgela) in the Netflix Prize contest in its early days – this fact becomes pretty clear when you look at the list of algorithms used by him for ICDM – they sound very Netflix-like :). To learn about the task, see “Traffic” task description page.
My solution was a linear mixture of about 20 predictors, based on three types of algorithms used with different parameters:
- Linear Least Squares (LLS),
- Supervised SVD-like factorization (Singular Value Decomposition),
- Restricted Boltzmann Machine (RBM) neural network.
The first one was based on weighted linear regression model. One set of regression parameters was computed for each target value (summary congestion at minutes 41-50 at road segment). Known target values were used as regressands. Averaged congestions per each segment were used as regressors. Regression weights (one per design matrix row) were computed as product of similarity and time distance from the target to the regressors averaging intervals. Limited amount of neighbors most similar to the predicted one was used for modeling. Several predictions were produced by this predictor with different averaging intervals, amounts of selected neighbors, using aligned or not aligned on hour boundary neighbors.
Similarity measure that was used in weighting and selections was computed in the following way:
S = P^a / E^b ,
where P, E – Pearson correlation coefficient and Euclidean distance between known congestion values; a, b – constants, in most cases a = 4, b = 6 were used.
The second predictor was based on the low rank SVD-like factorization of the congestion values in accordance with the matrix equation C = LR. Where C is matrix of the averaged in 10 minutes and weighted congestion values centered by subtraction of row and column means, some values of C are missing, L, R – learned left and right factors matrices. Dimensions of C were 2000×120, L 2000×16, R 16×120. L and R were initialized by random values and learned by gradient descent steps. Final predictions were made by solving weighted linear least squares problem defined by L and similarity measures. To produce cross-validated (CV) prediction values L and R were learned once for each CV fold.
Third predictor was based on Restricted Boltzmann Machine (RBM) neural network model. RBM with 1536 hidden units and ~58k visible units was used. Conditional Bernoulli distribution was used for modeling hidden units states. Softmax units were used as visible units. Each visible unit corresponds to one automobile. Vector of visible unit values corresponds to one hour of the simulation. RBM was trained by contrastive divergence method. Units with missed values do not made contribution to energy function of the network. Predictions were made by mean field update method.
About 20 predictions were produced by the described predictors and mixed by linear mixer with non-negative weights. The weights were computed separately for each of the road segments in accordance with the predicted CV values and some amount of similarity information. 100-fold CV was used in all predictions except RBM based (one CV fold per one simulation cycle). Two-fold CV was used in RBM based predictions.
I did not do any optimization of the algorithms in direction of computational efficiency. For example, several training passes (for cross validation) of the RBM with weight matrix 117k x 1.5k by 2M sample vectors (2k vectors x 1000 epochs) were performed. Each pass costs more than 2.8 petaFLOPs (117k*1.5k*2M*4*2 = 2.808e15) in matrix-matrix multiplications only. My algorithms – RBM and calculation of similarity – were implemented and executed on a GPU processor, so each pass took about 6 hours. On a standard quad-core CPU the same calculations, for every single pass, would take 40-60 hours.
* * *
The “Traffic” problem was described and discussed in depth elsewhere. But there is a main aspect to it that should be stressed here: it was required to make estimates of a multidimensional (length 20) vector. Most data mining methods, either published in books or implemented in software packages, only predict scalar responses.
After some brief visual inspection of the data, I started by trying to set the record straight and trying to save myself a precious time at later steps, this is:
- Choose a data analysis tool that would provide me with the required flexibility, that I could possibly run in the cloud if need be, etc. R package suited my needs perfectly.
- Create a good set of examples. I decided to group data into 10 minute slots and I could build a database of over 5000 independent samples.
- Build something like an abstract model which could combine any set of 20 models that fit unidimensional responses of my choice into a single one adapted to my multidimensional problem.
- Build a framework that would allow me to automate the selection of training and model validation, model fitting, and the creation of datasets to be submitted to the contest organizers. After this was achieved, testing a new model would be a matter of plug and play!
Book examples of highly parsimonious statistical methods (regressions, etc.) did not seem to provide satisfactory error rates. It was readily clear that the solution was much more local. The models that seemed to have a better behavior were those which retained a much higher amount of information from the training datasets: Random Forests, k-Nearest Neighbors, and the like.
Besides, I realized that not all these models mined the same wells of information in the training data. Some would ignore correlation structure among features, others would only pay attention to very narrow environments of new cases for prediction, etc.
This is why my final model was built as a convex combination —the convex combination that would minimize global error— of the three best models I had obtained so far, which was superior to any individual model. Error convexity means that inferior models can still help improve better models!
* * *
By Benjamin Hamner (hamner), 3rd place. Benjamin recently completed his bachelors degree at Duke University and is currently doing research at the EPFL, Lausanne, Switzerland, on a Whitaker Fellowship.
Preprocessing: Three training data sets were considered:
Set A – 1000 data points, corresponding to the first half-hour of each hour-long window in the training set.
Set B – 11,000 data points, Set A + shifts in the training window in one-minute increments for 10 minutes.
Set C – 55,000 data points, Set B + all additional possible shifts.
Set A had the advantage that it did not contain redundant data, and that it was drawn from the same distribution of the test set. Sets B and C involved augmenting the data with additional points, but at the cost of potentially shifting the distribution of the training set away from that of the test set.
The mean RMSE of each of these sets on the training data set was evaluated (split 50% training, 50% validation, 100-tree random forests, 15-minute downsampling window). The results were:
Set A – 24.62
Set B – 22.64
Set C – 22.59
The ATR time series was downsampled by summing the counts of cars over the road segments in consecutive intervals. The graph below shows how the width of the window affected performance within the training set (split 50% training, 50% validation, 12-tree Random Forests on Set C).
Regression: 100-tree random forests were used for regression on the training data sets.
Ensembles of Random Forests: The most successful ensemble was:
20% – Set B, 15-minute downsampling window (Validation score: 24.829)
40% – Set C, 10-minute downsampling window (Validation score: 24.637)
40% – Set C, 15-minute downsampling window (Validation score: 24.595)
Ensemble – Validation score 24.411, Test score 25.337
* * *
By Vladimir Nikulin (team UniQ), 4th place. Vladimir works at the Department of Mathematics, University of Queensland, Australia.
With the following approach we observed preliminary (leaderboard) result of 25.8. As a target variable we accepted the sum of the cars for the period between 41st and 50th minutes. Also, we used 600 features, which are the numbers of cars between 1st and 30th minutes for 20 given segments. Then, we produced 2 solutions using randomForest and GBM packages in R, which were linked together by an ensemble constructor as described in V. Nikulin: “Web-mining with Wilcoxon-based feature selection, ensembling and multiple binary classifiers.” (See web-site of the PKDD2010 Discovery Challenge). Further improvement to 24.502 was achieved by introducing 3 regulation parameters: p1 = 6 was used for smoothing of the features (moving averages), p2 = 8 – time interval (in minutes) between known and predicted data, p3 = 12 – smoothing parameter for the predicted time interval. Accordingly, the number of features was reduced to 500, and as a target we used the sum of cars for the period between 39th and 50th minutes. Cross-validation with 10 folds was used in order to optimise values of all regulation parameters.
Remark: The parameter p2 represents a novel compromise between training and test data: by making p2 smaller we are improving the quality of prediction (moving the target closer to the training data). On the other hand, we cannot go too far from the required test time-interval. By making p3 bigger compared to 10 we are simplifying the task. However, p3 cannot be too big. Alternatively, the task will suffer from over-smoothing. It is another compromise.
* * *
We thank all the authors for their descriptions and congratulate on achieving top scores!