Thursday, February 10, 2011

Statistics, data mining, machine learning, and culture

"You'll realize that you need a dictionary to go between measure theory and probability theory. The underlying concept is the same, but the terminologies are different." - Kathryn Hare, PM354 Measure Theory

When I started learning about statistical techniques, I knew the textbook definition of statistics and data mining. The more I worked in these areas, the less clear the distinctions became. Then I met people who loved machine learning but didn't know statistics, and those who haven't heard of machine learning as a statistician.

After asking people and reading around, I got some partial answer as to how these fields differ. In short, doing statistics is like asking a multiple choice question, mostly with two choices (i.e. is this true, or not?), with more emphasis on using data points "efficiently", since getting data from experiments is expensive. Data mining is more like doing exploratory analysis on a big data set, usually collected for other purposes, without guarantee of any results. Machine learning deals with automating decisions to optimize something in real time, so there is a focus on iterative methods and on-line algorithms that can generate better predictions over time.

Overall, though, the techniques used in each of these areas are pretty much the same. However, there is a pretty important difference between them, and that difference lies in the culture of the people using the techniques.

For example,
statisticians are a very different breed of people compared to people in data mining and machine learning. When I think of the word "statistician", I still somehow think of an old, bald man in his PhD suit reviewing papers and writing reports for his consulting work. "Machine learning" on the other hand, has a quite different feel to it. I think of hackers, people who just want to get something cool working -- a book recommendation, a way to predict which ads you click on -- and whose method of "reporting" primary consists of shouting across the room. "Data mining" seems to fall somewhere in between, but I haven't met enough data miners to be sure.

No, I don't think all statisticians are old bald men (in fact my mom is a statistician, and she's neither old, bald, nor a man). I think there are very cool statisticians out there that do really interesting and useful research (e.g. mom). I do think that each fields tends to have its own distinct culture, just as each company, school, or any non-random congregation of people would.

The culture of people in different fields is something pretty important to think about when we decide what to do with our lives. There are many, many interesting fields out there, and choosing one that is an epsilon "more interesting" than the others is not as fruitful as understanding the culture of the people in these fields: how they do work, how they collaborate, and what they are generally like.

End of Entry