IEEE ICDM Contest – Overview of Top Solutions, part 2

As previously announced, we publish further descriptions of top solutions of the IEEE ICDM Contest: TomTom Traffic Prediction for Intelligent GPS Navigation. This time you may read reports on Task 2 “Jams” and Task 3 “GPS”. Solutions of Task 1 can be found in Part 1.

If you want to ask the authors any questions feel free to comment below the post.

Task 2 “Jams”

By  Łukasz Romaszko (lukasz21212121),
the winner, from the University of Warsaw, Poland.

The algorithm focuses on computing ranks for each street (the higher rank means greater probability of traffic jam) and the number of streets to be jammed. The highest ranked streets are given as the output. In particular it uses an adaptation of the k-Nearest Neighbors algorithm. The following description of the algorithm is simplified. The scheme presenting the algorithm idea is given below.

Computing street ranks and the number p of jammed streets consists of several steps. From the training data set there are chosen two ordered sets: Similarity and LengthSimilarity. Similarity is used to compute street ranks, while LengthSimilarity to determinate number p. On the basis of Similarity set and special functions algorithm generates an array RankingList. In the next step RankingList will be slightly modified by taking into consideration the locations of streets. The top p streets are given as the output.


Let us denote by T the test data: the set of identifiers of the 5 excluded segments followed by a sequence of major roads where the first jams occurred during initial 20 minutes of the simulation.

Generating Similarity and LengthSimilarity ordered sets

At the beginning the algorithm compares sequences from Training data to T. Two different measures were used to compute Similarity and LengthSimilarity. Sequences are compared based on the indices positions of roads which were jammed in both sequences. Let Δ be the difference between length of  current sequence D and length of T. |T| denotes the number of jammed roads in T. The measure used to generate Similarity assumes that sequences of similar Δ length are more reliable. It is worth emphasizing that the metric used to generate Similarity set has the greatest influence on the result.

Sequences evaluation in LengthSimilarity was  similar to that in Similarity, but took into  consideration only sequences of Δ <=10. The algorithm for computing values in Similarity is described below: Read more of this post

IEEE ICDM Contest – Overview of Top Solutions, part 1

The IEEE ICDM Contest: TomTom Traffic Prediction for Intelligent GPS Navigation came to an end. As promised, we publish descriptions of top solutions, provided by participants. Although the reports had to be brief, the authors not only revealed a good deal of important details about their approaches, but also kept the descriptions straightforward and concise, giving all of us an unprecedented opportunity to learn the essence of data mining know-how. This is a good supplement to fully scientific articles that will be presented during Contest Workshop at the ICDM conference in Sydney.

Today, we publish descriptions for Task 1, “Traffic”. In the nearest days we’ll make another post with Task 2 and 3 reports – stay tuned! We thank all the authors for their contributions.

If you want to ask the authors any questions, feel free to comment below the post.

* * *

By Alexander Groznetsky (alegro), the winner. Alex is an experienced data miner who had participated (nick orgela) in the Netflix Prize contest in its early days – this fact becomes pretty clear when you look at the list of algorithms used by him for ICDM – they sound very Netflix-like :). To learn about the task, see “Traffic” task description page.

Read more of this post

There are no alternatives to data mining

During the 7th International Conference on Rough Sets and Current Trends in Computing (RSCTC 2010), we had a pleasure to talk with Vladimir Nikulin from University of Queensland, Australia, the 1st winner of the Basic Track in RSCTC’2010 Discovery Challenge: Mining DNA microarray data for medical diagnosis and treatment. Below you may read the transcript of our chat and learn what tips and tricks can be used to hack the data and achieve victory in a data mining competition. Obligatory reading for all practitioners and contestants.

Winners of RSCTC 2010 discovery challenge

Winners of RSCTC 2010 Discovery Challenge. From left: Guoyin Wang, Huan Luo, ChuanJiang Luo (Advanced Track), Vladimir Nikulin (Basic Track), Marcin Wojnarski (Chair)

Magdalena Pancewicz, TunedIT: How did it happen that data mining became your field of interest?

Vladimir Nikulin: I’ve got a PhD in Mathematical Statistics from the Moscow State University. I worked in Russia as a scientific programmer and then as an Associate Professor in Mathematics & Statistics. When I came to Australia in 1993, I had a very solid academic background, and it wasn’t difficult for me to study the most advanced methods of computational statistics and data mining (DM). The major progress in this area I made after joining the Australian National University in Canberra, where I worked for over 2 years, publishing papers and attending several international conferences. Also, I participated in data mining competitions, and won my first challenge in 2007: Agnostic Learning vs. Prior Knowledge Challenge.

Read more of this post

%d bloggers like this: