# IEEE ICDM Contest – Overview of Top Solutions, part 1

October 26, 2010 10 Comments

**The IEEE ICDM Contest: TomTom Traffic Prediction for Intelligent GPS Navigation came to an end. As promised, we publish descriptions of top solutions, provided by participants. Although the reports had to be brief, the authors not only revealed a good deal of important details about their approaches, but also kept the descriptions straightforward and concise, giving all of us an unprecedented opportunity to learn the essence of data mining know-how. This is a good supplement to fully scientific articles that will be presented during Contest Workshop at the ICDM conference in Sydney.
**

**Today, we publish descriptions for ****Task 1, “Traffic”****. In the nearest days we’ll make another post with Task 2 and 3 reports – stay tuned! We thank all the authors for their contributions.**

**If you want to ask the authors any questions, feel free to comment below the post.**

*** * *
**

**By Alexander Groznetsky ( alegro), the winner. Alex is an experienced data miner who had participated**

**(nick**

*orgela*)**in the Netflix Prize contest in its early days – this fact becomes pretty clear when you look at the list of algorithms used by him for ICDM – they sound very Netflix-like :). To learn about the task, see “Traffic” task description page.**

My solution was a linear mixture of about 20 predictors, based on three types of algorithms used with different parameters:

- Linear Least Squares (LLS),
- Supervised SVD-like factorization (Singular Value Decomposition),
- Restricted Boltzmann Machine (RBM) neural network.

The first one was based on weighted linear regression model. One set of regression parameters was computed for each target value (summary congestion at minutes 41-50 at road segment). Known target values were used as regressands. Averaged congestions per each segment were used as regressors. Regression weights (one per design matrix row) were computed as product of similarity and time distance from the target to the regressors averaging intervals. Limited amount of neighbors most similar to the predicted one was used for modeling. Several predictions were produced by this predictor with different averaging intervals, amounts of selected neighbors, using aligned or not aligned on hour boundary neighbors.

Similarity measure that was used in weighting and selections was computed in the following way:

S = P^a / E^b ,

where P, E – Pearson correlation coefficient and Euclidean distance between known congestion values; a, b – constants, in most cases a = 4, b = 6 were used.

The second predictor was based on the low rank SVD-like factorization of the congestion values in accordance with the matrix equation C = LR. Where C is matrix of the averaged in 10 minutes and weighted congestion values centered by subtraction of row and column means, some values of C are missing, L, R – learned left and right factors matrices. Dimensions of C were 2000×120, L 2000×16, R 16×120. L and R were initialized by random values and learned by gradient descent steps. Final predictions were made by solving weighted linear least squares problem defined by L and similarity measures. To produce cross-validated (CV) prediction values L and R were learned once for each CV fold.

Third predictor was based on Restricted Boltzmann Machine (RBM) neural network model. RBM with 1536 hidden units and ~58k visible units was used. Conditional Bernoulli distribution was used for modeling hidden units states. Softmax units were used as visible units. Each visible unit corresponds to one automobile. Vector of visible unit values corresponds to one hour of the simulation. RBM was trained by contrastive divergence method. Units with missed values do not made contribution to energy function of the network. Predictions were made by mean field update method.

About 20 predictions were produced by the described predictors and mixed by linear mixer with non-negative weights. The weights were computed separately for each of the road segments in accordance with the predicted CV values and some amount of similarity information. 100-fold CV was used in all predictions except RBM based (one CV fold per one simulation cycle). Two-fold CV was used in RBM based predictions.

I did not do any optimization of the algorithms in direction of computational efficiency. For example, several training passes (for cross validation) of the RBM with weight matrix 117k x 1.5k by 2M sample vectors (2k vectors x 1000 epochs) were performed. Each pass costs more than 2.8 petaFLOPs (117k*1.5k*2M*4*2 = 2.808e15) in matrix-matrix multiplications only. My algorithms – RBM and calculation of similarity – were implemented and executed on a GPU processor, so each pass took about 6 hours. On a standard quad-core CPU the same calculations, for every single pass, would take 40-60 hours.

*** * ***

**By ****Carlos J. Gil Bellosta ( datanalytics)**

**, 2nd place.**

**Carlos is the principal and owner of Spanish data mining company, Datanalytics.**

The “Traffic” problem was described and discussed in depth elsewhere. But there is a main aspect to it that should be stressed here: it was required to make estimates of a multidimensional (length 20) vector. Most data mining methods, either published in books or implemented in software packages, only predict scalar responses.

After some brief visual inspection of the data, I started by trying to set the record straight and trying to save myself a precious time at later steps, this is:

- Choose a data analysis tool that would provide me with the required flexibility, that I could possibly run in the cloud if need be, etc. R package suited my needs perfectly.
- Create a good set of examples. I decided to group data into 10 minute slots and I could build a database of over 5000 independent samples.
- Build something like an abstract model which could combine any set of 20 models that fit unidimensional responses of my choice into a single one adapted to my multidimensional problem.
- Build a framework that would allow me to automate the selection of training and model validation, model fitting, and the creation of datasets to be submitted to the contest organizers. After this was achieved, testing a new model would be a matter of plug and play!

Book examples of highly parsimonious statistical methods (regressions, etc.) did not seem to provide satisfactory error rates. It was readily clear that the solution was much more local. The models that seemed to have a better behavior were those which retained a much higher amount of information from the training datasets: Random Forests, k-Nearest Neighbors, and the like.

Besides, I realized that not all these models mined the same wells of information in the training data. Some would ignore correlation structure among features, others would only pay attention to very narrow environments of new cases for prediction, etc.

This is why my final model was built as a *convex combination* —the convex combination that would minimize global error— of the three best models I had obtained so far, which was superior to any individual model. Error convexity means that inferior models can still help improve better models!

*** * ***

**By ****Benjamin Hamner ( hamner)**

**, 3rd place. Benjamin recently completed his bachelors degree at Duke University and is currently doing research at the EPFL, Lausanne, Switzerland, on a Whitaker Fellowship.**

Preprocessing: Three training data sets were considered:

*Set A* – 1000 data points, corresponding to the first half-hour of each hour-long window in the training set.

*Set B* – 11,000 data points, Set A + shifts in the training window in one-minute increments for 10 minutes.

*Set C* – 55,000 data points, Set B + all additional possible shifts.

Set A had the advantage that it did not contain redundant data, and that it was drawn from the same distribution of the test set. Sets B and C involved augmenting the data with additional points, but at the cost of potentially shifting the distribution of the training set away from that of the test set.

The mean RMSE of each of these sets on the training data set was evaluated (split 50% training, 50% validation, 100-tree random forests, 15-minute downsampling window). The results were:

*Set A* – 24.62

*Set B* – 22.64

*Set C* – 22.59

The ATR time series was downsampled by summing the counts of cars over the road segments in consecutive intervals. The graph below shows how the width of the window affected performance within the training set (split 50% training, 50% validation, 12-tree Random Forests on Set C).

Regression: 100-tree random forests were used for regression on the training data sets.

Ensembles of Random Forests: The most successful ensemble was:

20% – Set B, 15-minute downsampling window (Validation score: 24.829)

40% – Set C, 10-minute downsampling window (Validation score: 24.637)

40% – Set C, 15-minute downsampling window (Validation score: 24.595)

Ensemble – Validation score 24.411, Test score 25.337

*** * ***

**By Vladimir Nikulin (team UniQ), **

**4th place.**

**Vladimir works at the Department of Mathematics, University of Queensland, Australia.**

With the following approach we observed preliminary (leaderboard) result of 25.8. As a target variable we accepted the sum of the cars for the period between 41st and 50th minutes. Also, we used 600 features, which are the numbers of cars between 1st and 30th minutes for 20 given segments. Then, we produced 2 solutions using randomForest and GBM packages in R, which were linked together by an ensemble constructor as described in V. Nikulin: “Web-mining with Wilcoxon-based feature selection, ensembling and multiple binary classifiers.”* (See web-site of the PKDD2010 Discovery Challenge*). Further improvement to 24.502 was achieved by introducing 3 regulation parameters: p1 = 6 was used for smoothing of the features (moving averages), p2 = 8 – time interval (in minutes) between known and predicted data, p3 = 12 – smoothing parameter for the predicted time interval. Accordingly, the number of features was reduced to 500, and as a target we used the sum of cars for the period between 39th and 50th minutes. Cross-validation with 10 folds was used in order to optimise values of all regulation parameters.

Remark: The parameter p2 represents a novel compromise between training and test data: by making p2 smaller we are improving the quality of prediction (moving the target closer to the training data). On the other hand, we cannot go too far from the required test time-interval. By making p3 bigger compared to 10 we are simplifying the task. However, p3 cannot be too big. Alternatively, the task will suffer from over-smoothing. It is another compromise.

*** * ***

**We thank all the authors for their descriptions and congratulate on achieving top scores!**

** **

Set B – 11,000 data points, Set A + shifts in the training window in one-minute increments for 10 minutes.

Set C – 55,000 data points, Set B + all additional possible shifts.Set A had the advantage that it did not contain redundant data, and that it was drawn from the same distribution of the test set. Sets B and C involved augmenting the data with additional points, but at the cost of potentially shifting the distribution of the training set away from that of the test set.The mean RMSE of each of these sets on the training data set was evaluated (split 50% training, 50% validation, 100-tree random forests, 15-minute downsampling window). The results were:

Set A – 24.62

Set B – 22.64

Set C – 22.59

The ATR time series was downsampled by summing the counts of cars over the road segments in consecutive intervals. The graph below shows how the width of the window affected performance within the training set (split 50% training, 50% validation, 12-tree Random Forests on Set C).

Regression: 100-tree random forests were used for regression on the training data sets.

Ensembles of Random Forests: The most successful ensemble was:

20% – Set B, 15-minute downsampling window (Validation score: 24.829)

40% – Set C, 10-minute downsampling window (Validation score: 24.637)

40% – Set C, 15-minute downsampling window (Validation score: 24.595)

Ensemble – Validation score 24.411, Test score 25.337

Task 3 – GPS

Transforming GPS Coordinates to Road Segments: An algorithm was developed to rapidly determine which road segment a car was on given its GPS coordinate. The roads in Warsaw were preprocessed by laying a grid of points over the map and determining which road segments lay within a certain radius. This meant that, given a GPS coordinate, only the distance to road segments near the four closest grid points to the GPS coordinate needed evaluation. The preprocessing was reiterated with finer grids of points.

Though there are surely faster algorithms, this worked – it allowed the 200 million GPS readings in the competition data set to be transformed to road segments in under two days, instead of several years for the brute force approach. No corrections were made for GPS coordinates outside of Warsaw, which likely had a slight negative impact on the results.

Preprocessing: Three sets of features were extracted from the raw and transformed data, representing the local and aggregate traffic flow. The first set represented aggregate traffic flow: a grid of 16 points was put on the map of Warsaw. The number of cars closest to each point in two categories (stopped and moving) were counted for each half-hour time window, providing 32 features. The second set of features represented the aggregate traffic flow as well and utilized the transformed data. Three high-dimensional matrices were formed, one with the counts of moving cars on each road segment in each time window, one with the counts of stopped cars, and with the mean speed of the moving cars. A hacked-together version of Principal Component Analysis (PCA) that could run quickly on such high-dimensional matrices on a laptop was applied, and the first 12 principal components were taken for each matrix. This provided an additional 36 features.

The set of features for local traffic flow varied based on the edge being predicted. These features included the counts of moving and stopped cars along with the mean speed of moving cars on both the edge being predicted and the edges connected to it. This accounted for an additional 6-42 features, depending on the edge.

Regression: Two 100-tree random forests were trained for each of the 100 edges being forecast, one for the 31st-36th minute predictions, and one for the 54th-60th minute predictions. While some of the actual velocities had bimodal or trimodal distributions, the predictions were almost always unimodal (see the graph below).

To account for this, random forests were first trained to classify the data into different contexts if the velocity distributions were bimodal or trimodal, and then new random forests performed regression within each context. For example, on the edge in the above graph, a random forest was first trained to split the data into three groups: (1) likely having a speed < 20 km/hr, (2) likely having a speed from 20-60 km/hr, and (3) likely having a speed above 60 km/hr.

Making the regression context-dependent substantially improved the results. The graph below shows how parameterizations over subsets of the possible features affected the performance on the training portion of the dataset (split 50% train, 50% validation). For the aggregate counts, the parameter values 1-5 correspond to 4, 16, 64, 144, and 256 regions used to predict traffic flow. For the aggregate PCA model, the parameter values 1-6 correspond to 2, 4, 8, 12, 16, and 32 principal components used per matrix. Different parameters were not evaluated for models shown between parameter values 3 and 4.

Computation: Everything was done in Matlab on a 2.53 GHz MacBook Pro with 4 GB of RAM.

Other Thoughts: This was an entirely unprincipled approach to the problem, beyond “if it looks like it works, go with it.” This worked very well for the competition, and left a lot of room for future improvement. Tracking the movement of individual cars through time, extracting additional features, applying feature selection, and testing alternative regression and modeling techniques may improve results on the existing data set. Real-world data sets could have additional information that could enhance the predictive power of algorithms. These include data on the date, time, and weather, as well as traffic information from other data sources.

Question to Alex Groznetsky.

I like very much your idea of using SVD – I was really surprised that this method can be applied here. But it’s still unclear for me what’s contained in the matrix C that you decompose, and how to interpret L and R matrices? So, to be precise, my questions are:

1. How did you come up with the “2000” and “120” dimensions for C?

2. What’s the meaning of L and R? For example, in Netflix Prize these matrices represented

features of moviesandpreferences of users. How can they be interpreted in the case of traffic prediction?Thanks.

Hello Marcin,

1. Each row of the C corresponds to one hour of the simulation. There was 1000 rows with data from the Train set and 1000 from the Probe set. Summary traffic values in consequent 10 minute intervals for each of the road segment were used to fill the rows (20 segments * 6 values/hour = 120 values/row). Following pseudocode may clear remaining questions:

for HourIndex = 0 to 1000-1

for IntervalIndex = 0 to 6-1

for SegmentIndex = 0 to 20-1

C[HourIndex][IntervalIndex*20+SegmentIndex] = TrainData[(HourIndex*60+IntervalIndex*10)*20+SegmentIndex];

for HourIndex = 1000 to 2000-1

for IntervalIndex = 0 to 3-1

for SegmentIndex = 0 to 20-1

C[HourIndex][IntervalIndex*20+SegmentIndex] = ProbeData[(HourIndex*30+IntervalIndex*10)*20+SegmentIndex];

for IntervalIndex = 3 to 6-1

for SegmentIndex = 0 to 20-1

C[HourIndex][IntervalIndex*20+SegmentIndex] = MissedValue;

2. Netflix analogy: L represents features of the simulation hours across all segments, R – features of the simulation cycles for each of the segments. SSA analogy: L – short term (1 hour) traffic correlations between segments, R – long term traffic correlations betweeen the simulation cycles.

Starting idea was to learn in the R from a low frequency spectral characteristics of the roads graph outside the 10 selected segments. This is not works well.

Pingback: IEEE ICDM Contest – Overview of Top Solutions, part 2 « TunedIT Data Mining Blog

Question to Alexander Groznetsky :

Hi, Alexander:

I have a problem regarding the model of this machine learning problem. In your solution, you states that the target value(regressands) is the sum of congestion at minutes 41-50 at road segment. In my opinion, this sounds reasonable for the test data. But with respect to training data, one cycle has data collected in 600 continuous miniutes, that is, one cycle contains 600 lines, then, how to determine which group of lines to use to generate a target value? Can I simply use the data from line 41 to line 50 to generate one target value, and use the data from line 101 (60+41) to line 110 (60+50) to generate another target value, and so on? If so, does this method sound reasonable? In short, my question is transform the raw training data to the training data in the standard form, that is, each line contains a target value followed by a feature vector.

Since I’m a machine learning novice, my question might sounds stupid to you, but I’m dying to learn about it. Thank you for your attention and help.

Hello Zheng Jiangchuan,

> Can I simply use the data from line 41 to line 50 to generate one target value, and use the data from line 101 (60+41) to line 110 (60+50) to generate another target value, and so on?

This is exactly that i meant. 20 target values (one per segment) from lines 41-50, next 20 from 101-110 …

> If so, does this method sound reasonable?

For me yes. Why not for you?

By the way you can shift the point sequences (30 minutes train, 10 minutes pause, 10 minutes target) inside 10 hours simulation cycles with 1 minute steps. But i recommend you to start from the sequences aligned on the simulation hour boundaries (i.e. as defined above).

Regards,

Alexander.

Hi, Alexander, thanks very much for your kind reply.

I’m also interested in your SVD approach, but since I’m not familiar with the SVD-like approach used in collaborative filtering in Netflix Prize, it might be difficult for me to understand the key idea here. I know it would be time-consuming for you to explain the technical details here regarding this approach, so would you mind telling me ifrom which paper or sources could I learn the details of SVD approach you used in Netflix Prize similar to the one you used here? I’m eager to learn about it.

Thanks again and best regards

Zheng Jiangchuan

Hello Zheng Jiangchuan,

I can recommend you to read this for start. I can’t point you to direct analogue of my approach, sorry. As far as I can remember most similar approaches were described in “A. Paterek. Improving regularized singular value decomposition for collaborative filtering”. Search in the Google the quoted string. There were a lot of variations in the Netflix challenge. Look at “Cited by …” papers in the Google search results. Other source for related papers is Netflix Prize forum.

Regards,

Alexander.

Thanks very much for you kind and patient help. Your suggestions helps me a lot.

Best

Jiangchuan

Dear Vladimir Nikulin:

I have a question, in this specific application, how to transform the raw training data to the training data in standard form(that is, each line corresponds to an instance, which contains a target value and a feature vector)? In the raw training data, one cycle contains 600 lines (corresponding to 600 consecutive miniutes), did you do this work in this way:

“Use line 1 to line 30 to generate the feature vector a1 , and use line 41 to line 50 to generate the target value y1 , and regard (y1, a1) as instance 1; then use line 61 to line 90 to generate the feature vector a2, and use line 101 to line 110 to generate the target y2, and regard (y2, a2) as instance2; and so on”

In short, my major confusion is how to appropriately partition the lines of data in the training data file, that is, which group of lines should be used to generate input variables, and which groupd of lines should be used to generate output variables(target value).

Thank you for your attention and help.

Hi Zheng Jiangchuan,

please, note that we are dealing here not with vectors, but with the matrices.

There are 30 rows and (!) 20 columns, in total 600 elements, which we can

re-arrange as a single row (vector of explanatory variables).

The following matrix from 41st to 50th rows will be summed to the vector of

20 elements, which will be considered as a target variables (of course, not all together,

but one after one – predictions of the traffic at the single intersections).

There are 1000 hours for training, where any hour will produce 10 lines as it was described above.

So, the training matrix will be 10K by 600.

Best regards, Vladimir