There are no alternatives to data mining
July 20, 2010 4 Comments
During the 7th International Conference on Rough Sets and Current Trends in Computing (RSCTC 2010), we had a pleasure to talk with Vladimir Nikulin from University of Queensland, Australia, the 1st winner of the Basic Track in RSCTC’2010 Discovery Challenge: Mining DNA microarray data for medical diagnosis and treatment. Below you may read the transcript of our chat and learn what tips and tricks can be used to hack the data and achieve victory in a data mining competition. Obligatory reading for all practitioners and contestants.
Magdalena Pancewicz, TunedIT: How did it happen that data mining became your field of interest?
Vladimir Nikulin: I’ve got a PhD in Mathematical Statistics from the Moscow State University. I worked in Russia as a scientific programmer and then as an Associate Professor in Mathematics & Statistics. When I came to Australia in 1993, I had a very solid academic background, and it wasn’t difficult for me to study the most advanced methods of computational statistics and data mining (DM). The major progress in this area I made after joining the Australian National University in Canberra, where I worked for over 2 years, publishing papers and attending several international conferences. Also, I participated in data mining competitions, and won my first challenge in 2007: Agnostic Learning vs. Prior Knowledge Challenge.
Theoretical statistics, which I studied originally, has very limited applicability compared to data mining, because of idealistic models and assumptions which are too general. I think that data mining and computational statistics [don’t confuse with theoretical statistics – eds.] are one of the fastest growing and most promising scientific areas, because high technologies are generating huge databases, which can’t be handled manually. That’s why I went into DM.
What is, in your opinion, the future of data mining?
There are no alternatives to data mining. This is not a theoretical area so it is not about proving some sort of theorems. There are many datasets, there are many methods of data mining, and the most important is to find the right match. To use proper methods against a particular dataset, in a proper way.
How to do this?
Be flexible. There are many regulation parameters, and, I think, the key in selection and evaluation of algorithms is cross-validation or leave-one-out method, which is applicable in the case of small sample sizes. Leave-one-out may be regarded as a perfect form of cross-validation. Today at the RSCTC conference I gave a talk about my methods. We use a very flexible framework which is based on cross-validation, where we are free to select the most suitable feature selection and classification methods. Using this framework we can make the right choice of the models and their parameters. Besides, the models themselves are very flexible and dependent on many regulation parameters.
Running cross validation requires a lot of computational time, so we use a powerful machine – monstrous Linux workstation at the University of Queensland. In many cases this station is working constantly including overnights and weekends. It is a multi-processor computer, which produces large amounts of numerical results. It’s necessary to maintain all the records very carefully. Basically, I’m recording the results in a proper form including many graphical illustrations, and based on these tables/graphs I’m doing the analysis: which particular models are better, which particular parameter settings are better. Typically, I come to work in the morning and collect new results produced overnight, which give me new directions where to go next. I’m giving the computer new settings, and the cycle is repeating again and again, until the deadline of the contest.
How long does it usually take to find a solution?
In some cases the solution itself may be very simple, but difficult to discover. In the case of data mining competitions time is very limited. If you consider for example the last year ICDM competition, the task was very complex: clustering of three-dimensional curves. When you scan the brain, it is represented in computer memory as quite big, three-dimensional image, about 0.5 GB in size. It is necessary to group all these fibers into clusters, into bundles, they call it. So it is very difficult task but, again, why did I win this competition? I just found the right direction from one of the earliest steps, and followed this direction. Usually the difference between the first and the second best solution in data mining competitions may be 1-2%, 3-5% is quite significant. In the case of latest ICDM 2009 competition the difference between the first and the second result was more than 50%!
So finding a solution in data mining is like choosing and following the right path?
Yes. For example, in the latest ACM KDD Cup competition – prediction of students’ performance in the area of mathematics – I wasn’t so lucky. The data were complex and required right preprocessing. I made some mistakes at that stage and my results were far from good. During the last 24 hours of the competition I found some of these mistakes and made significant progress but the time was over. I needed about 1-2 weeks more, but there was no deadline extension. The KDD Cup is an annual event, so we’ll see, maybe the next time I’ll have more luck. On the other hand, there’s a direct correspondence between “luck” and experience.
And what problems did you face in RSCTC 2010 Challenge?
Bioinformatics is directly relevant to our (Statistics Group at the UQ) research interests, it’s our major research direction. Just before this challenge I published several papers on classification of high-dimensional data and had three publications in the area of bioinformatics, so I was quite well prepared. It was very interesting and important challenge.
Why did you decide to take part in this competition? Only because you were well prepared?
Yes, and, on the other hand, in order to be prepared I have to participate. Right now I am participating in some other competitions. I am sitting here, but I am participating. My computers are working and my PhD students are working as well.
And is there any difference between data in various competitions? What differs biological and, say, psychological (student behavior) data?
Of course, there are some differences, but they are not significant. Data mining is like mathematics – it is general scientific area and the methods of data mining are general methods.
What do you think about competitions? Are they a good way for scientists to carry out research? Are they a good way for companies to “crowdsource” efficient algorithms?
Yes. Data mining competitions represent a rapidly growing and very important part of computational statistics. Practically every large company in the world has a data mining department, responsible for data analysis and modeling. These activities may be expensive, but they’re unavoidable. And participation in data mining competitions may be useful for wide range of researchers including academics, consultants and students, because practical experience is the best way to learn. Moreover, competitions represent a very important form of online (convincing) evaluation of different models and algorithms.
What books and journals would you recommend for PhD students or others interested in data mining?
The most famous book is Friedman’s Elements of Statistical Learning [T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2009]. It may be regarded as a handbook. Jerome Friedman and his joint authors are well known and respected scientists from the Stanford University. The book contains all necessary guidance to develop initial skills in this particular area. And then… Google! With Google you’ll find whatever you like. There are plenty of papers on the Internet, available for free.
And what about software? Tools, frameworks, libraries.
There are very promising packages in R, like GBM. Also, ADA, however ADA in most cases is not as good as GBM. Also, I would recommend Random Forest… there are many methods. But in many cases you can’t rely on standard software. For example, in the previous ICDM competition, where the task was very specific and complicated, I didn’t use any standard software at all, only my own tools, written in C or MATLAB. Also, I can recommend Perl, which may be very handy to work with textual information.
Do you enjoy Poland and the RSCTC conference?
Poland is a very good and friendly country, and the RSCTC conference is very well organized, no problems at all. I’d like to congratulate the organisers for such an impressive event.
Do you find any topics particularly interesting?
Today there was a very interesting session. Andrzej Janusz, the winner of Australian AusDM 2009 Analytic Challenge, made a presentation regarding his participation. I was a runner-up in that event, so it was very interesting for me. Also the session on classification of imbalanced data was interesting.
You won the Basic Track of RSCTC challenge. What do you think about the winning solution of Advanced Track?
We produced very impressive results (according to the Leaderboard) with the multilayer perceptron, but we relied on this model too much. Probably, it would be better to make an alternative submission with Support Vector Machines. That’s what the Chinese team, winners of the Advanced Track, did. In most cases it’s a good idea to consider some alternatives.
Apart from the competition, do you work with bioinformatics data?
Professor Geoff McLachlan with whom I cooperate has a joint academic appointment at University of Queensland and at the Institute for Molecular Bioscience. He works with real-world bioinformatics data and I, also, participate in this research work. We may expect that our internationally recognized results will help us increase, or at least maintain, our current funding opportunities from the Australian Research Council.
Regarding research and competitions, is it easier to work on your own or in a team?
Depends on what sort of a team. Team means diversity of skills – this is the key element of the team work. If everybody had the same sort of skills as me, what sense would it make to work in such a team? I worked for the Advanced Track in a team with Tian-Hsiang Huang, a PhD student from Taiwan, and it was great. What’s difficult for me, for Tian-Hsiang was easy. He’s very skillful in Java programming. Also, he developed my website. In some cases, he doesn’t understand very well the principles of statistics and mathematics – but this is easy for me. Team may be great.
Do you intend to take part in the latest competition organized by TunedIT – the ICDM Contest related to road traffic prediction?
Yes, we’re interested. We’ll consider this task very seriously. However, time series analysis is not my major strength, there are also some problems with free time. But we’ll do our best.
You’ve mentioned Russia. Where exactly do you come from?
Kirov is my native city, 600 km to the east from Moscow. It is nice regional city, and there is still a good environment. Frankly, it’s too exhausting for me to live in such a huge city as Moscow.
And do you like the place you live in, Australia?
I have some professional and financial interests there, but without such interests I wouldn’t stay in Australia.
I have a workplace in Kirov at the Vyatka State University. I have accommodation there and colleagues who consider me a staff member. In Russia, it’s a very interesting time now. Russia is dynamic and vibrant. Every year I spend holidays there.
You’re also taking part in many conferences around the world. Which place among those you’ve visited is your favorite one?
Washington is very interesting, but very expensive too. There are plenty of museums near the administrative centre and the White House. For example, the Museums of Air-space and Natural History were very impressive. In Genoa, Italy, where I was in October last year, I enjoyed the soft mediterranean climate. In Hong Kong, quite the contrary, the climate was terrible. I was there approximately in about this time, end of June. I like the weather here, in Poland, it’s warm and sunny. Now, I’m looking forward to visit Barcelona in about one week time – I’ll have a presentation there, at the WCCI 2010, about my solution for ICDM 2009 Contest (you may read the paper).
What do you do in your free time?
I like cycling very much. There are good facilities for cycling in Brisbane. Some sport activities are necessary, to do maintenance for yourself. Also, I like reading, but, frankly, have less and less time to do that.
Thank you for the interview.
Details of Vladimir’s algorithm can be found in the report.