There are no alternatives to data mining

During the 7th International Conference on Rough Sets and Current Trends in Computing (RSCTC 2010), we had a pleasure to talk with Vladimir Nikulin from University of Queensland, Australia, the 1st winner of the Basic Track in RSCTC’2010 Discovery Challenge: Mining DNA microarray data for medical diagnosis and treatment. Below you may read the transcript of our chat and learn what tips and tricks can be used to hack the data and achieve victory in a data mining competition. Obligatory reading for all practitioners and contestants.

Winners of RSCTC 2010 discovery challenge

Winners of RSCTC 2010 Discovery Challenge. From left: Guoyin Wang, Huan Luo, ChuanJiang Luo (Advanced Track), Vladimir Nikulin (Basic Track), Marcin Wojnarski (Chair)

Magdalena Pancewicz, TunedIT: How did it happen that data mining became your field of interest?

Vladimir Nikulin: I’ve got a PhD in Mathematical Statistics from the Moscow State University. I worked in Russia as a scientific programmer and then as an Associate Professor in Mathematics & Statistics. When I came to Australia in 1993, I had a very solid academic background, and it wasn’t difficult for me to study the most advanced methods of computational statistics and data mining (DM). The major progress in this area I made after joining the Australian National University in Canberra, where I worked for over 2 years, publishing papers and attending several international conferences. Also, I participated in data mining competitions, and won my first challenge in 2007: Agnostic Learning vs. Prior Knowledge Challenge.

Read more of this post

Data mining curiosities: RSCTC 2010 write-up

In the previous week we had an excellent data mining conference in Warsaw – Rough Sets and Current Trends in Computing (RSCTC). Several months ago, TunedIT had organized the Discovery Challenge for RSCTC: analysis of genetic data for medical purposes. Now, there was a challenge session where the winners presented their solutions to general public. Everyone was really curious how they did it and many questions followed after their talks, so they had no choice but to lift the curtain on their secret tricks. If anyone still wants to learn more, I recommend looking into the challenge paper – to be found here or in conference proceedings (pp. 4-19). We’ll also post shortly an interview with one of the winners, so stay tuned!

Apart from the contest, the conference brought many interesting presentations. First of all, there were four invited keynote talks given by prominent researchers, professors: Roman Słowiński, Sankar Pal, Rakesh Agrawal and Katia Sycara.

Rakesh Agrawal is the head of Microsoft Search Labs, responsible for the development of Microsoft’s Bing search engine. In his talk, Search and Data: The Virtuous Cycle, he sketched what kinds of data mining problems they face when trying to make Bing more “intelligent”, so that search results contain exactly the pages that the user is looking for. It appears that one of the toughest problems is to discover real intentions of the user: what is he really looking for? Search engine knows only the query string, usually very short (1-2 words,often misspelled), say “Ireland”, and must guess what the user expects: travel guide for a tourist or geographical facts about the country? Another problem is that many words have several different meanings: if the user writes “polish” does it mean a verb, “to polish”, or an adjective, “Polish”? Yet another problem: how to deal with numbers in a smart way? The query “$200 camera” gives few sensible results if treated literally – better try “$199 camera” :-)

Rakesh Agrawal at RSCTC

Many more issues of this kind must be dealt with. Add that the algorithms must dig through petabytes of data in a matter of seconds, and you’ll have no doubts that guys in Microsoft Search Labs never complain about boring assignments. BTW, I must confirm from own experience that data size and performance requirements are critical factors to make data mining fun. With small data and no performance difficulties, data mining is just an interesting thing to do. When performance begins to play a role, you discover that 95% of your fantastic algorithms just don’t catch up and you’ve got to turn all the bright ideas (and software) upside down.

Katia Sycara at RSCTCAnother talk which I really enjoyed – Emergent Dynamics of Information Propagation in Large Networks – was delivered by Katia Sycara from Carnegie Mellon University. It’s interesting to observe how large networks of “agents”, for example people, share information among themselves on a peer-to-peer basis, like through gossiping, and how the information fills the whole network at some point in time or – conversely – suddenly disappears. It’s important that we can predict evolution of such processes, because in real world the “information” distributed may be an infectious disease whose spread should be stopped as soon as possible; or an operator’s request that must be distributed to all computers in a large decentralized network, in a shortest possible time.

Which outcome is observed depends on different parameters of the network: how many connections there are between agents, what’s the topology (uniform connections? separated clusters?), how keen the agents are to pass the gossip further on. But what’s the most interesting is that Read more of this post

Ciekawostki data mining: RSCTC 2010

W zeszłym tygodniu w Warszawie miała miejsce znakomita konferencja data mining’owa, Rough Sets and Current Trends in Computing (RSCTC). Kilka miesięcy temu TunedIT zorganizowało w ramach RSCTC konkurs Discovery Challenge: Analiza danych genetycznych dla celów medycznych. Teraz odbyła się sesja konkursowa, na której zwycięzcy zaprezentowali publicznie swoje rozwiązania. Audytorium słuchało z uwagą, a po skończonych prezentacja posypały się pytania, więc zwycięzcy nie mieli innego wyjścia jak uchylić rąbka swoich tajemnic. Jeśli ktoś chciałby dowiedzieć się więcej na temat ich rozwiązań, polecam zajrzeć do raportu z konkursu lub do materiałów pokonferencyjnych (str. 4-19). Wkrótce zamieścimy także wywiad z jednym ze zwycięzców, więc śledźcie uważnie naszego bloga!

Konferencja przyniosła również wiele innych ciekawych prezentacji, poza sesją konkursową. Przede wszystkim, odbyły się cztery otwarte wykłady wygłoszone przez światowej klasy naukowców, profesorów: Romana Słowińskiego, Sankara Pala, Rakesha Agrawala i Katię Sycara.

Rakesh Agrawal jest szefem laboratoriów Microsoft Search Labs, odpowiedzialnych za rozwój silnika microsoftowej wyszukiwarki Bing. W swojej prezentacji, Read more of this post

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: