June 11, 2010 Leave a comment
An interesting post by Mike Loukides at O’Reilly blogs: What is data science? The title question is hard to answer. Most likely there’s no single answer that everyone would agree upon. But still, Mike makes a couple of good points and observations that are worth quoting:
The web is full of “data-driven apps.” Almost any e-commerce application is a data-driven application. (…) But merely using data isn’t really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product.
I would add that not only the web is full of data. The amount of data grows exponentially in every domain, be it on-line or off-line apps. But the users are moving more and more from off-line to web applications, plus it’s easier and more natural to merge together data from different users when things happen on the web than in an off-line scenario. Some examples of off-line applications: analysis of medical records, bioinformatics & genetics, video surveillance, energy demand forecasting, industrial control systems.
In the last few years, there has been an explosion in the amount of data that’s available. Whether we’re talking about web server logs, tweet streams, online transaction records, “citizen science,” data from sensors, government data, or some other source, the problem isn’t finding data, it’s figuring out what to do with it.
Yep. Data is the king. I like examples with CDDB and Google. It’s good to realize that 97% of Google revenue actually comes from data mining algorithms: PageRank (smart search engine) combined with AdSense and AdWords (intelligent online advertising). To put it differently, 23 bln $ of Google revenue in 2009 came from data mining algorithms. It’s data mining and machine learning which make Google search engine so accurate in answering queries and which attract so many users. It’s data mining and machine learning which allow Google to present digital advertisements in optimal place and time, to users who are potentially most interested in a given product.
At the same time, intelligent algorithms make up as little as 1% (or less) of their whole code base. Google has lots of other software that has nothing in common with data mining – various web apps (like Google Docs), libraries, widgets, APIs – but the core, the critical code in terms of their revenue, the code that makes Google be Google, is data mining!
This relation – 97% of revenue from 1% of code base – is very typical for data mining applications. On the other hand, this 1% of code is very hard to invent, much harder than the other 99%. I wonder how much do data mining algorithms make for Google in terms of costs? Mainly for paying the specialists who devise them and thoroughly, step by step, over long period of time, tune them up? I would guess for a number that’s closer to 99% than 1%.
The question every company is facing today (…) is how to use data effectively (…). Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.
Nothing to add.