Ed Ramsden on his winning solution in SIAM SDM’11 Contest

By Ed Ramsden (EdR), the winner of SIAM SDM’11 Data Mining Contest.

The basis for my model was the Probabilistic Neural Network (PNN), which was originally introduced by Donald Specht in 1990. The PNN is an example-based classifier in which the ‘X’ vector for an unknown case to be classified is compared to all known-class cases used in the training set.  A distance metric (typically Euclidean) is passed through a gaussian function to estimate a ‘probability’ of a match with each training case.  These individual probabilities are combined for each class in the training set, and the class with the highest composite probability (W) is selected as the most likely class for the unknown case.  The evaluation and combination function used was:

Although a PNN can be used with little or no training, this problem posed several difficulties. The first was the high dimensionality of the input data. Because they are example based, PNN classifiers require their input ‘space’ to be reasonably well filled in order to perform. As the number of input features is increased, one would expect their input space to become exponentially sparser. The solution to this was to employ feature selection.  Also, another challenge for obtaining good performance is the proper selection of σ,  which controls the selectivity of the classifier. If one makes  σ too large, the classifier will tend lose the ability to differentiate between different input data. On the other hand, if  σ is made too small, the classifier loses the ability to generalize beyond its training set.  The problems of both feature and  σ selection were solved by using a guided random walk, with the objective of maximizing the Modified Youdon performance on the training set. One feature of this approach is that it does not require the calculation of gradient information, only the value of the metric being maximized.  To avoid severe overtraining effects, a leave-one-out scheme was used to evaluate training-set performance.

Because the PNN model developed as described above only sees a small subset of available inputs, I decided to attempt to increase the performance through constructing ensembles of the PNNs, and then taking a simple vote among their outputs to decide the final classification.

As one can see from the following plot, there is substantial variation in both the training and final test Modified Youdon measures for different models, with a degree of correlation between the training metric and the final test metric.  This led to the idea of constructing the final voting pool out of a subset of models with superior training performance.

In the end, the submission model consisted of a vote of the best 25 out of 135 candidate PNN models (by training score) constructed using 35 features. This yielded a training score of 0.794, and a final test score of 0.689.  Note that while some individual sub-models would have had very similar performance to the ensemble model, there was no obvious way of reliably identifying such high-performing sub models a priori, so the ensemble technique allowed for the combination of a number of good (and not so good) models into a better one.

I developed the model generation code in Visual Basic .NET, and did the final vote taking using a spreadsheet.  The generation and tuning of the 135 candidate models required nearly 8  hours on a single processor core of an Intel E5300.

— Ed Ramsden

Winner’s notes. Yuchun Tang on noise deduction to improve classification accuracy in SIAM SDM’11 Contest

By Yuchun Tang (piaopiao), the runner-up in SIAM SDM’11 Contest.

QSAR data provided for SIAM SDM’11 Contest were known to be highly noisy. Around 30% of labels provided could be wrong due to experimental uncertainty, as reported by the organizers after the contest was closed. Furthermore, this contest only counted the last submission, which means it was risky to overtune the models on the known data (including training data and preliminary test data).

In my approach, initially, a 7-fold cross validation strategy was adopted for modeling on the training data. Several classification algorithms were tried and the best CV results (in terms of Balanced Youden Index) were observed with R gbm and randomForest techniques. At that point the performance for gbm was 0.659/0.664/0.640 (in the order of 7-CV/preliminary/final), and for rf it was 0.636/0.718/0.628. (Of course, I only know the final performance after the contest is closed). I also tried different feature selection methods but I did not see obvious improvement so I decided Read more of this post

Winners’ notes. Frank Lemke on self-organizing high-dimensional QSAR modeling from noisy data

By Frank Lemke (pat) from KnowledgeMiner Software, 3rd in SIAM SDM’11 Contest.

Safety of pharmaceutical and chemical products with respect to human health and the environment has been a major concern for the public, regulatory bodies, and the industry, for a long time and this demand is increasing. Safety aspects start in the early design phases of drugs and chemical compounds and they end formally with the official authorization by national and international regulators. Traditionally, for decades, animal tests have been using as the preferred accepted tool – kind of Gold Standard, which, in fact, it is not – for testing harmful effects of chemicals on living species or the environment. Currently, in Europe only, about 10 million animals per year are (ab)used for laboratory experiments, and a lot of time and billions of Euros are spent into these experiments. So, we as consumers who use and value chemical products every day everywhere in some form are safe? No! Not really. About 90% of the chemicals on the market today have never been tested or have not been requested, officially, to be tested. There is a simple reason, apparently: Despite the ethical issues of animal testing – it is estimated that additional 10 – 50 million vertebrate animals would be required if all 150,000 registered substances would have to be tested in this traditional way – it is simply not possible to run animal tests for this amount of substances within reasonable time and cost constraints. Animal tests cannot do that. To solve this problem, there is a strong demand for alternative testing methods like QSAR models to help minimizing and widely substituting animal tests in the future.

Read more of this post

Winners’ notes. Robert Wieczorkowski, Yasser Tabandeh and Harris Papadopoulos on SIAM SDM’11 Contest

We have a pleasure to publish three after-challenge reports authored by participants of SIAM DM 2011 Data Mining Contest who achieved ex aequo the 4th best score (differed only in time of submission). Hope you’ll find the reports insightful. To view full results of the competition see Leaderboard and Summary pages.

* * *

By Robert Wieczorkowski, Ph.D. (rwieczor).

SIAM SDM’11 Contest was my second challenge in which I participated on TunedIT. Previously I took part in IEEE ICDM Contest (Traffic Prediction) and ended on the 12th place. Taking part in this challenge was for me a form of new practical experience in using modern statistical tools.

Read more of this post

%d bloggers like this: