There are no alternatives to data mining

During the 7th International Conference on Rough Sets and Current Trends in Computing (RSCTC 2010), we had a pleasure to talk with Vladimir Nikulin from University of Queensland, Australia, the 1st winner of the Basic Track in RSCTC’2010 Discovery Challenge: Mining DNA microarray data for medical diagnosis and treatment. Below you may read the transcript of our chat and learn what tips and tricks can be used to hack the data and achieve victory in a data mining competition. Obligatory reading for all practitioners and contestants.

Winners of RSCTC 2010 discovery challenge

Winners of RSCTC 2010 Discovery Challenge. From left: Guoyin Wang, Huan Luo, ChuanJiang Luo (Advanced Track), Vladimir Nikulin (Basic Track), Marcin Wojnarski (Chair)

Magdalena Pancewicz, TunedIT: How did it happen that data mining became your field of interest?

Vladimir Nikulin: I’ve got a PhD in Mathematical Statistics from the Moscow State University. I worked in Russia as a scientific programmer and then as an Associate Professor in Mathematics & Statistics. When I came to Australia in 1993, I had a very solid academic background, and it wasn’t difficult for me to study the most advanced methods of computational statistics and data mining (DM). The major progress in this area I made after joining the Australian National University in Canberra, where I worked for over 2 years, publishing papers and attending several international conferences. Also, I participated in data mining competitions, and won my first challenge in 2007: Agnostic Learning vs. Prior Knowledge Challenge.

Read more of this post

Data mining curiosities: RSCTC 2010 write-up

In the previous week we had an excellent data mining conference in Warsaw – Rough Sets and Current Trends in Computing (RSCTC). Several months ago, TunedIT had organized the Discovery Challenge for RSCTC: analysis of genetic data for medical purposes. Now, there was a challenge session where the winners presented their solutions to general public. Everyone was really curious how they did it and many questions followed after their talks, so they had no choice but to lift the curtain on their secret tricks. If anyone still wants to learn more, I recommend looking into the challenge paper – to be found here or in conference proceedings (pp. 4-19). We’ll also post shortly an interview with one of the winners, so stay tuned!

Apart from the contest, the conference brought many interesting presentations. First of all, there were four invited keynote talks given by prominent researchers, professors: Roman S艂owi艅ski, Sankar Pal, Rakesh Agrawal and Katia Sycara.

Rakesh Agrawal is the head of Microsoft Search Labs, responsible for the development of Microsoft’s Bing search engine. In his talk, Search and Data: The Virtuous Cycle, he sketched what kinds of data mining problems they face when trying to make Bing more “intelligent”, so that search results contain exactly the pages that the user is looking for. It appears that one of the toughest problems is to discover real intentions of the user: what is he really looking for? Search engine knows only the query string, usually very short (1-2 words,often misspelled), say “Ireland”, and must guess what the user expects: travel guide for a tourist or geographical facts about the country? Another problem is that many words have several different meanings: if the user writes “polish” does it mean a verb, “to polish”, or an adjective, “Polish”? Yet another problem: how to deal with numbers in a smart way? The query “$200 camera” gives few sensible results if treated literally – better try “$199 camera” 馃檪

Rakesh Agrawal at RSCTC

Many more issues of this kind must be dealt with. Add that the algorithms must dig through petabytes of data in a matter of seconds, and you’ll have no doubts that guys in Microsoft Search Labs never complain about boring assignments. BTW, I must confirm from own experience that data size and performance requirements are critical factors to make data mining fun. With small data and no performance difficulties, data mining is just an interesting thing to do. When performance begins to play a role, you discover that 95% of your fantastic algorithms just don’t catch up and you’ve got to turn all the bright ideas (and software) upside down.

Katia Sycara at RSCTCAnother talk which I really enjoyed – Emergent Dynamics of Information Propagation in Large Networks – was delivered by Katia Sycara from Carnegie Mellon University. It’s interesting to observe how large networks of “agents”, for example people, share information among themselves on a peer-to-peer basis, like through gossiping, and how the information fills the whole network at some point in time or – conversely – suddenly disappears. It’s important that we can predict evolution of such processes, because in real world the “information” distributed may be an infectious disease whose spread should be stopped as soon as possible; or an operator’s request that must be distributed to all computers in a large decentralized network, in a shortest possible time.

Which outcome is observed depends on different parameters of the network: how many connections there are between agents, what’s the topology (uniform connections? separated clusters?), how keen the agents are to pass the gossip further on. But what’s the most interesting is that Read more of this post

Ciekawostki data mining: RSCTC 2010

W zesz艂ym tygodniu w Warszawie mia艂a miejsce znakomita konferencja data mining’owa, Rough Sets and Current Trends in Computing (RSCTC). Kilka miesi臋cy temu TunedIT zorganizowa艂o w ramach RSCTC konkurs Discovery Challenge: Analiza danych genetycznych dla cel贸w medycznych. Teraz odby艂a si臋 sesja konkursowa, na kt贸rej zwyci臋zcy zaprezentowali publicznie swoje rozwi膮zania. Audytorium s艂ucha艂o z uwag膮, a po sko艅czonych prezentacja posypa艂y si臋 pytania, wi臋c zwyci臋zcy nie mieli innego wyj艣cia jak uchyli膰 r膮bka swoich tajemnic. Je艣li kto艣 chcia艂by dowiedzie膰 si臋 wi臋cej na temat ich rozwi膮za艅, polecam zajrze膰 do raportu z konkursu lub do materia艂贸w pokonferencyjnych (str. 4-19). Wkr贸tce zamie艣cimy tak偶e wywiad z jednym ze zwyci臋zc贸w, wi臋c 艣led藕cie uwa偶nie naszego bloga!

Konferencja przynios艂a r贸wnie偶 wiele innych ciekawych prezentacji, poza sesj膮 konkursow膮. Przede wszystkim, odby艂y si臋 cztery otwarte wyk艂ady wyg艂oszone przez 艣wiatowej klasy naukowc贸w, profesor贸w: Romana S艂owi艅skiego, Sankara Pala, Rakesha Agrawala i Kati臋 Sycara.

Rakesh Agrawal jest szefem laboratori贸w Microsoft Search Labs, odpowiedzialnych za rozw贸j silnika microsoftowej wyszukiwarki Bing. W swojej prezentacji, Read more of this post

What is data science?

An interesting post by Mike Loukides at O’Reilly blogs: What is data science? The title question is hard to answer. Most likely there’s no single answer that everyone would agree upon. But still, Mike makes a couple of good points and observations that are worth quoting:

The web is full of聽 “data-driven apps.” Almost any e-commerce application is a data-driven application. (…) But merely using data isn’t really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product.

I would add that not only the web is full of data. The amount of data grows exponentially in every domain, be it on-line or off-line apps. But the users are moving more and more from off-line to web applications, plus it’s easier and more natural to merge together data from different users when things happen on the web than in an off-line scenario. Some examples of off-line applications: analysis of medical records, bioinformatics & genetics, video surveillance, energy demand forecasting, industrial control systems.

In the last few years, there has been an explosion in the amount of data that’s available. Whether we’re talking about web server logs, tweet streams, online transaction records, “citizen science,” data from sensors, government data, or some other source, the problem isn’t finding data, it’s figuring out what to do with it.

Data Scientist
Yep. Data is the king. I like examples with CDDB and Google. It’s good to realize that 97% of Google revenue actually comes from data mining algorithms: PageRank (smart search engine) combined with AdSense and AdWords (intelligent online advertising). To put it differently, 23 bln $ of Google revenue in 2009 came from data mining algorithms. It’s聽 data mining and machine learning which make Google search engine so accurate in answering queries and which attract聽 so many users. It’s data mining and machine learning which allow Google to present digital advertisements in optimal place and time, to users who are potentially most interested in a given product.

At the same time,聽 intelligent algorithms make up as little as 1% (or less) of their whole code base. Google has lots of other software that has nothing in common with data mining – various web apps (like Google Docs), libraries, widgets, APIs – but the core, the critical code in terms of their revenue, the code that makes Google be Google, is data mining!

This relation – 97% of revenue from 1% of code base – is very typical for data mining applications. On the other hand, this 1% of code is very hard to invent, much harder than the other 99%. I wonder how much do data mining algorithms make for Google in terms of costs? Mainly for paying the specialists who devise them and thoroughly, step by step, over long period of time, tune them up? I would guess for a number that’s closer to 99% than 1%.

The question every company is facing today (…) is how to use data effectively (…). Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.

Nothing to add.

What is data science?

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine

The Spirit of TunedIT

Some time ago I spoke to a friend of mine, Pawe艂 Szcz臋sny from the Polish Academy of Sciences 鈥 a biologist, a visionary of Open Science and a pioneer of scientific blogging. When I mentioned about our plans to start a blog for TunedIT, Pawe艂, after giving it a serious thought, had come up with the following advice: 鈥濺emember one thing: do not write about yourself. If someone writes about oneself, the blog becomes terribly boring. Only if you keep writing about something different, it has a chance to be interesting鈥.

Nerd Ghost

The Nerd Ghost

At first, this tip of advice seemed illogical to me 鈥 what’s the point in opening a blog related to the web portal TunedIT, if we are to write about something totally different? All in all, this is so natural: if a new functionality comes up on TunedIT, we will mention it in the blog: 鈥濼oday a new functionality has been released, which enables … it helps in … you can use it like this …鈥 and so on. If there’s going to be a new competition: 鈥濼oday we’re launching a new competition … the task is to …鈥 Isn’t it the way you do it? Each of us could instantly list a long series of blogs where similar posts can be encountered. Don’t they sound so familiar, so natural, so conventional, so … banal? Hm, wait a moment. Banal? Actually, … Have I read many blogs like that? Sure! A lot! How many of them have I read further than to the second sentence of the paragraph? I can’t recall any… So maybe writing about yourself is not the best choice for your blog, in fact? But if not, what else then makes it tick? Read more of this post

Duch TunedIT

Jaki艣 czas temu rozmawia艂em ze znajomym, dr Paw艂em Szcz臋snym z Polskiej Akademii Nauk – biologiem, wizjonerem Otwartej Nauki i pionierem naukowej blogosfery. Kiedy wspomnia艂em mu, 偶e planuj臋 uruchomi膰 bloga dla TunedIT, Pawe艂 po chwili zastanowienia da艂 nast臋puj膮c膮 rad臋. “Pami臋tajcie tylko o jednym: 偶eby nie pisa膰 o sobie. Je艣li kto艣 pisze o sobie, blog staje si臋 potwornie nudny. Tylko je艣li pisze o czym艣 innym, ma szans臋 by膰 interesuj膮cy”.

W pierwszej chwili uwaga ta wyda艂a mi si臋 nielogiczna, bo jaki jest sens w zak艂adaniu bloga powi膮zanego z portalem internetowym TunedIT, je艣li potem trzeba pisa膰 o czym艣 zupe艂nie innym? To jest przecie偶 takie naturalne: Read more of this post

%d bloggers like this: