# IEEE ICDM Contest – Overview of Top Solutions, part 2

November 5, 2010 1 Comment

**As previously announced, we publish further descriptions of top solutions of ****the IEEE ICDM Contest: TomTom Traffic Prediction for Intelligent GPS Navigation.**

**This time you may read**

**reports**

**on**

**Task 2 “Jams” and Task 3 “GPS”. Solutions of Task 1 can be found in Part 1.**

**If you want to ask the authors any questions feel free to comment below the post.**

### Task 2 “Jams”

By Łukasz Romaszko (*lukasz21212121*),** ****the winner, from the ****University of Warsaw****, Poland.**** **

The algorithm focuses on computing ranks for each street (the higher rank means greater probability of traffic jam) and the number of streets to be jammed. The highest ranked streets are given as the output. In particular it uses an adaptation of the k-Nearest Neighbors algorithm. The following description of the algorithm is simplified. The scheme presenting the algorithm idea is given below.

Computing street ranks and the number* p *of jammed streets consists of several steps. From the training data set there are chosen two ordered sets: *Similarity* and *LengthSimilarity*.* Similarity* is used to compute street ranks, while *LengthSimilarity* to determinate number *p*. On the basis of *Similarity* set and special functions algorithm generates an array RankingList. In the next step *RankingList* will be slightly modified by taking into consideration the locations of streets. The top *p* streets are given as the output.

*Algorithm*

Let us denote by *T* the test data: the set of identifiers of the 5 excluded segments followed by a sequence of major roads where the first jams occurred during initial 20 minutes of the simulation.

*Generating Similarity and LengthSimilarity ordered sets*

At the beginning the algorithm compares sequences from Training data to* T*. Two different measures were used to compute *Similarity *and* LengthSimilarity*. Sequences are compared based on the indices positions of roads which were jammed in both sequences. Let Δ be the difference between length of current sequence D and length of T. |T| denotes the number of jammed roads in T. The measure used to generate *Similarity* assumes that sequences of similar Δ length are more reliable. It is worth emphasizing that the metric used to generate *Similarity *set has the greatest influence on the result.

Sequences evaluation in *LengthSimilarity* was similar to that in* Similarity*, but took into consideration only sequences of Δ <=10. The algorithm for computing values in *Similarity* is described below:

M := f * |T|; { the best result was achieved with f = 1.10 } FOR each training data sequence D DO mS := 0; { used to evaluate D in Similarity } FOR each identifier Di at position i in D DO IF exists Tj in T that Tj = Di THEN mS := mS + max(0, M - |i-j|); Similarity(D) := Weight(Δ)*(mS/|D|); { Sort Similarity and LengthSimilarity decreasingly. }

*Analyzing Similarity and LengthSimilarity sets*

* Algorithm 1*: Roads Evaluation.

Streets which are jammed the most frequently in *Similarity* are thought to jam in the test case with the highest probability. The algorithm evaluates each street assuming that most similar sequences to *T *are the most reliable: Moreover, streets jammed at initial positions have greater influence on the street rank. The algorithm counts how often did streets occurred in the top N best sequences in *Similarity*.

*Algorithm 2*: Length Prediction of the output sequence.

The value of p depends on average sequence length in *LengthSimilarity*. The evaluation metric MAP appreciates listing of expected jammed streets even if they are listed after the expected total number of roads, therefore the output sequence should be longer than *averageLength* (factor 1.5).

*Improvements and the output*

For each street s if that street was evaluated to jam too frequently or too seldom than it should (in average based on local evaluation on tests created from part of Training data set), the *RankingList*[*s*] was multiplied by the factor < 1.0 or > 1.0, respectively. Later *RankingList* was sorted decreasingly. Subsequently, for each street in * Result*, we count the minimum distance to any other edge in *Result: a*. Connected edges have greater probability of getting jammed therefore they should be listed earlier. There are a few iterations, during which the roads order is changed slightly: streets identifiers in the output are swapped if any street has higher* a *value then adjacent. Eventually, the first *p* streets were given as the output.

*Results*

The Training data set was split into two parts: Training part (which consisted of 80% of whole data) and a validation part (20%). Parameters mentioned above were tuned independently by repeating the experiment and evaluating by the validation part. It was a very time consuming process. The code was written on Windows XP operating system in Java. Each execution time for 5000 cases on a processor 2.8Ghz was six minutes (about 0.07 s per one test case). This algorithm achieved the best result in the contest: 0.5598 points in the preliminary evaluation and 0.56056 in the final.

### Task 3 “GPS”

** ****By Benjamin Hamner ( hamner),**

**the winner.**

**Benjamin recently completed his bachelors degree at Duke University and is currently doing research at the EPFL, Lausanne, Switzerland, on a Whitaker Fellowship.**

*Transforming GPS Coordinates to Road Segments*

An algorithm was developed to rapidly determine which road segment a car was on given its GPS coordinate. The roads in Warsaw were preprocessed by laying a grid of points over the map and determining which road segments lay within a certain radius. This meant that, given a GPS coordinate, only the distance to road segments near the four closest grid points to the GPS coordinate needed evaluation. The preprocessing was reiterated with finer grids of points.

Though there are surely faster algorithms, this worked – it allowed the 200 million GPS readings in the competition data set to be transformed to road segments in under two days, instead of several years for the brute force approach. No corrections were made for GPS coordinates outside of Warsaw, which likely had a slight negative impact on the results.

*Preprocessing*

Three sets of features were extracted from the raw and transformed data, representing the local and aggregate traffic flow. The first set represented aggregate traffic flow: a grid of 16 points was put on the map of Warsaw. The number of cars closest to each point in two categories (stopped and moving) were counted for each half-hour time window, providing 32 features. The second set of features represented the aggregate traffic flow as well and utilized the transformed data. Three high-dimensional matrices were formed, one with the counts of moving cars on each road segment in each time window, one with the counts of stopped cars, and with the mean speed of the moving cars. A hacked-together version of Principal Component Analysis (PCA) that could run quickly on such high-dimensional matrices on a laptop was applied, and the first 12 principal components were taken for each matrix. This provided an additional 36 features.

The set of features for local traffic flow varied based on the edge being predicted. These features included the counts of moving and stopped cars along with the mean speed of moving cars on both the edge being predicted and the edges connected to it. This accounted for an additional 6-42 features, depending on the edge.

*Regression*

Two 100-tree random forests were trained for each of the 100 edges being forecast, one for the 31st-36th minute predictions, and one for the 54th-60th minute predictions. While some of the actual velocities had bimodal or trimodal distributions, the predictions were almost always unimodal (see the graph below).

To account for this, random forests were first trained to classify the data into different contexts if the velocity distributions were bimodal or trimodal, and then new random forests performed regression within each context. For example, on the edge in the above graph, a random forest was first trained to split the data into three groups: (1) likely having a speed < 20 km/hr, (2) likely having a speed from 20-60 km/hr, and (3) likely having a speed above 60 km/hr.

Making the regression context-dependent substantially improved the results. The graph below shows how parameterizations over subsets of the possible features affected the performance on the training portion of the dataset (split 50% train, 50% validation). For the aggregate counts, the parameter values 1-5 correspond to 4, 16, 64, 144, and 256 regions used to predict traffic flow. For the aggregate PCA model, the parameter values 1-6 correspond to 2, 4, 8, 12, 16, and 32 principal components used per matrix. Different parameters were not evaluated for models shown between parameter values 3 and 4.

*Computation*

Everything was done in Matlab on a 2.53 GHz MacBook Pro with 4 GB of RAM.

*Other Thoughts*

This was an entirely unprincipled approach to the problem, beyond “if it looks like it works, go with it.” This worked very well for the competition, and left a lot of room for future improvement. Tracking the movement of individual cars through time, extracting additional features, applying feature selection, and testing alternative regression and modeling techniques may improve results on the existing data set. Real-world data sets could have additional information that could enhance the predictive power of algorithms. These include data on the date, time, and weather, as well as traffic information from other data sources.

*** * ***

**By Andrzej Janusz**** ( NosferatoCorp)**

**, 3rd place. Andrzej is a PhD student at the University of Warsaw, Poland.**

In this task, data consisted of a large number of GPS logs from cars driving through the streets of Warsaw acquired in a number of simulations. Road map of Warsaw was represented as a street graph. The objective of the task was to predict average velocities of cars passing through 100 selected edges.

The dataset was preprocessed prior to further analysis. For each of 50 training and 500 test simulations, the cars which passed through selected edges were extracted. The training simulations were divided into disjoint 6 minute time windows, corresponding to the available information about the true average velocity values at the selected edges, after 6 and 30 minutes time stamps. The time windows were described by 3×100 attributes expressing average velocities of cars passing through each of 100 selected edges, edges that end at the starting nodes of the selected edges and those which begin at the ending nodes of the edges from the task. Values of those attributes were computed using the filtered GPS logs.

A decision table was constructed in order to create a predictive model. For each simulation every 5 consecutive time windows were merged. Those descriptions of 30 minute periods were treated as objects. The training data contained 4550 such instances (each described by 5×300 attributes and having 200 target decision values) and the test set consisted of 500 instances. All the missing values were linearly interpolated.

The k-Nearest Neighbors regression model was used for making predictions. The standard algorithm was modified by introducing additional sets of weights of objects assigned to each of the target decision attributes. The weights expressed the average impact of particular objects at the squared error of predictions. They were computed during the initial cross-validation on the training set (a leave-one-out technique was used to minimize the temporal bias).

After the computation of the weights, the second cross-validation has been performed in which irrelevant attributes were filtered out (a standard correlation-based filter was used) and the *k* value was tuned (separately for each of the target decisions). Finally, the predictions were made. For every test instance, *k* nearest neighbors were selected (in Manhattan metric) and a weighted average of their decision values was taken. The predictions were additionally weighted based on distances of the neighbors from the tested objects. A standard triweight kernel was utilized. The model was implemented in R System using the “kknn” package.

*** * ***

**By Amr Kabardy and Nayer Wanas**** (team amrkabardy), 5th place. **

**Amr and Nayer work in**

**the Microsoft Innovation Lab, Cairo, Egypt.**

Traffic is repetitive in nature, both spatially and temporally, and in response to certain patterns and events. Given the data provided, and this observation, the approach aims to use the leading 30-mins, in a given segment, to identify the most similar 30-mins pattern of velocities in a segment of the same profile. The profile of a segment is based on the number of lanes and the estimated throughput. The cosine similarity of the estimated velocities is used as a similarity measure. Moreover, it is weighted based on the temporal and spatial separation between the segments. Estimation of the velocity includes smoothing the instantaneous velocities reported, identifying the most representative readings, removing outliers and averaging to produce an overall velocity of the segment at a given time. In the case of the lack of suitable data, the ratio of the estimated velocity to the maximum velocities of inbound and outbound segments is used to estimate the velocity. It is worth mentioning that smoothing the reported velocities through averaging on the 50 cycles produces better approximation, which in turn supports the original assumption.

*** * ***

**We thank all the authors for their descriptions and wish good luck in next TunedIT competitions!**

*M *:= *f ** * |T|*; (the best result was achieved with *f *= 1.10)

FOR each training data sequence *D *DO

*m**S *:= 0 (used to evaluate *D *in *Similarity*);

FOR each identifier *D**i *at position *i *in *D *DO

IF exists *T**j *in *T *that *T**j *= *D**i *THEN

*m**S *:= *m**S *+ *max*(0*, M – |i-j|*);* *

*Similarity*(*D*) := *W*eight(Δ)**(**m**S**/|D|)*;

Sort *Similarity *and *LengthSimilarity *decreasingly.

Pingback: IEEE ICDM Contest – Overview of Top Solutions, part 1 « TunedIT Data Mining Blog