Friday, December 24, 2010

"What's on your mind?"

Click here to see my post on the Facebook Data Team blog.

Pressing "Publish" on that thing when no one else was around (and with the moral support of other interns) was a great way to end my internship.

Oh and yes, I'm back in Toronto now.

End of Entry

Tuesday, December 21, 2010

Things I learned

Cameron, the Data Science manager at Facebook, dropped an innocuous-sounding question at the intern goodbye dinner last Friday: "What was the most interesting/important thing you learned this term, not necessarily at Facebook?"

I would have responded with something witty, except I'm obviously too slow for that. Thinking back though, enough had happened this term for me to give a non-idiotic answer. In fact, I can list three pretty important things that I learned this term. Here they are, in no particular order:

1) Famous people are famous because they do things. There's nothing more to it, and nothing less.

Being intimidated by people who are famous is something I still haven't gotten over, but between accidentally cutting in front of Zuck in the dish line, asking Donald Knuth a stupid question in his Christmas tree lecture*, and seeing my friend Paul Butler become famous, I realized that their ticket to fame is actually really simple: when they had an idea, they followed through.

Simply put, they did things. They executed.

There really is nothing more to it. I'm sure more of my friends will decide to do things, get noticed, and as a side effect, become famous. There's nothing intimidating or far-fetched about that.

2) I don't know much about statistics.

I was learning statistics in a pretty non-standard (a.k.a. "hands-on") way, which was basically (a) try to solve problems with what I know, (b) fail, (c) read, (d) fail a little bit less. While this had taught me a LOT, the process was the best at teaching me exactly what I don't know. No, I don't mean the cliché and unhelpful "I learned that there's so much more to this subject!" bullshit. The process helped me build a concrete to do list of what subjects to research, what books to read, and what fun projects to attempt. For now though, I'll survive by knowing that logistic regression is the answer to 99% of questions in statistics.

3) When you decide to do things, opportunities come.

Our world is really a land of opportunity. Especially at Facebook, the difference between saying "yes" to something and saying "no" is astronomical. Saying "yes" or just doing interesting things seems like such a simple thing to do, and those who did it (e.g. Paul, Gurrinder, etc.) got great results. I'll be honest though: I haven't been as keen on saying "yes" this term as compared to last term, and it really showed. We worry so much more about macro-decisions like where to go to school and where to do the next internship, so it's funny to notice that micro-decisions such as "should I do this today or next week?" can be just as life-changing.

Well, for better or for worse, that was my term. I did some things I'm proud of and a few that I'm not. I failed a lot, but learned a lot too. Hopefully, in 2011, I won't let trivial fears set me back: I'll do more, try more, and say "yes" more. There's just too much to lose otherwise.

And yes, Facebook was awesome.

End of Entry

*me: "You used n in two different ways!" Knuth: "... to show an equality."

Saturday, October 16, 2010

A better visualization of job posting length vs application

Here's a plot that should have been included in part three of Mining Jobmine. It describes the change in distribution of applications as the length of job posting changes. For each x-value representing a certain length of job posting, the plot gives an estimate on the portion of jobs that has 0 to 22 applications, 23 to 40 applications, 41 to 70 applications and >71 applications. The end points are chosen to be the 25th, 50th and 75th percentiles of the number of applications.

The size of the "0 to 22 applications" category increases steadily as the length of job posting increases from 30-ish to around 500, indicating a drop in application. But as the length of job posting increases beyond 500 words, the size of the bottom-most category decreases. This decrease is offset by an increase in the size of the ">71" category. My guess is that jobs with really long job postings are ones where multiple positions are advertised (e.g. Google job posting...).

Compared to what I had before, this is a much better way of visualizing the correlation between length of job posting and application.

End of Entry

Trains and German Tanks: a Probability Problem

This problem had bugged me for quite a while, and since many people had contributed to solving it, I thought I should write it up. It's a problem that first came up in an introductory probability course, and was used to teach us the concept of maximum likelihood. Try it yourself before reading the solutions if you like probability puzzles...

Problem Statement

You're standing outside by a railroad. A train pass by, and you realize that it is train #m. You know that the company that operates the trains (Viarail, CN, or whatever) numbers their trains sequentially, so that there is a train #1, #2, #3, all the way up to train #n. Assume that each of the n trains is equally likely to pass by where you stood. Now, given m, the train number that you just saw, estimate n, the total number of trains owned by the company.

Some Comments

You don't get much information from this problem: the only thing you get is the value of m, and you have to guess n. The ambiguity comes from not knowing how our guesses will be judged: should we maximize the probability of guessing n correctly? Should we minimize squared error? Or perhaps we should minimize absolute variance. Each of these different ways of measuring how good our guess is will lead to a different solution.

Solution #1: Maximizing the Probability of Being Right

This was the way my professor interpreted the problem. To maximize our chances of guessing n correctly, we would use maximum likelihood. We choose a guess for n so that P(M=m|N=n) is maximal. Note n has to be at least as big as m, and P(M=m|N=n)=1/n for n>=m, which gets smaller as n increases. So we would pick the value m to be our estimate for n.

This is an unintuitive answer. Again, it's because this solution is only optimal if you only care about guessing correctly, with no credit given for getting close to the real value of n. Run a simulation, and you will see that guessing n=m will indeed maximize your chances of being correct.

Solution #2: Minimizing Squared Error

After some discussions with Greg, I did a simulation were n is drawn from a uniform distribution from 1 to something large, then m was drawn from a uniform distribution from 1 to n. Then we estimated n using km for various values of k, to see what values of k gave the most number of correct guesses (n=km), and also the least square error. As mentioned, setting k=1 (our maximum likelihood estimate) gave the most number of correct guesses, but a different value of k gave the best least square error: k=1.5.

This is a strange value, and it confused me for the longest time. Fortunately, William was able to derive this value using math. We want to minimize E((n-kM)^2) where M~U(1,n), noting that E(M)=(n+1)/2 and E(M^2) = Var(M) + E(M)^2. Expanding E((n-kM)^2), setting its derivative with respect to k to zero, then solving for k gives k = (3n^2+3n)/(2n^2+3n+1), which is approximately 1.5 for large enough n.

Solution #3: Minimizing Absolute Variance

This is relatively simple: set E(n-kM)=0 to get that k should be approximately 2. Amusingly enough, Paul and Kevin pointed out that a generalization of this problem actually came up in real life. Apparently Germans in WWII numbered their tanks sequentially, and so the Western allies were able to use statistical techniques to estimate the number of tanks they had. See if you're interested.


The optimal estimate of a variable depends on how you penalize errors (i.e. what statisticians call the loss function). Numbering things sequentially can be really, really dumb. Having nerdy friends who can solve your problems is awesome.

End of Entry

Friday, September 17, 2010

A note on believing

I would never die for my beliefs because I might be wrong. -- Bertrand Russell
The thing that confuses me the most about most religions is the idea of "faith", the idea that one should believe in something without having (enough) evidence for it. Even more puzzling are those who tries to convince others of their God by telling them to "believe". If you step back for a moment, it becomes quite humourous, actually. Just imagine:
Person A: Believe in my God!
Person B: Ok.
What I'm really confused about is this: what does it actually mean for Person B to say "Ok, I've decided to believe in God"? Or rather -- does it actually mean anything? Do we really get to choose what sounds reasonable in our minds? Isn't what we end up believing based on the evidence that we have (i.e. our experiences) and how we interpret those? Thus to change our opinion on something, we could (a) bring out new evidence that we haven't considered, or (b) refine our methods of reasoning about those evidences.

I'm really curious about this -- do you think you can decide what you believe to be true? I don't think I can. Sure I can lie to myself, and I'm perfectly capable of pretend to believe in something. I can want to believe something, and notice myself biasing the evidence towards it. But deep down, I'd still know that it's wrong, that perhaps there's not (yet) enough evidence. This knowing is much stronger than the pretend-belief, and my hypothesis is that we all have this compass inside of us that knows how plausible things actually are.

Of course, I might be wrong.

End of Entry

Monday, August 9, 2010

a little game

About a week ago, I took a walk and found myself in front of a maple tree. I said hello, and asked for permission to come closer. Feeling no resistance I stepped under it. Like other maple trees in August, it was full of those seeds that spun as it fell. I watched as a few made its way to the ground. Just for fun, I asked the tree to drop one of its seeds to me, so I would catch it. The tree didn't seem to mind: one seed fell close to me, but it wasn't close enough to be within reach. I asked to try again, because I knew I'd catch it this time, I promise! The tree dropped another seed, this time so close that I really should have caught it -- alas it went right between my fingers. A moment of disappointment later I felt something cold on my stomach: the seed had fallen against the rim of my pants. I did catch it!

Moments like this leave me chuckling at my ignorance to the nature of our existence. These moments don't always prove anything, but even when they do, you have to be the one experiencing it to really believe it. Is there any point in us sharing them?

Oh and FYI, there was a squirrel on that tree. It made weird chipping noises and flicked its tail up and down. I'm not sure what that meant, so I stepped back for a while until it stopped. It was one strange squirrel.

End of Entry

Thursday, July 29, 2010

Mining Jobmine: Part 3. From the Employer's Perspective

Recently, Paul asked the question of what would make his resume more effective. I now ask a very similar question from the employers point of view: What can employers do to make job postings more effective? While AB Testing job postings is not an option for me, it is possible to look at Jobmine data to find attributes of job postings that correlate with number of applications.

Keep job postings short

There is a negative correlation between the number of words in a job posting, and application rate. This correlation is very small, but still statistically significant. Below is a smoothed scatter plot of words per job postings vs. applications (darker colours mean a denser packing of points), with a curve of best fit [1].

The curve implies a loss of about one application for every 50-60 words added. Again, the decrease is slight, and the length of a job description explains very little of the variation amongst application rates. This is not surprising: many factors affect application rate of a job, such as the actual job, and we expect the effect of the length of a job description to be minor compared to more important factors.

To uncover other subtle factors affecting application rate, I tried a technique I learned at Facebook: for each job posting, I calculated the percentage of words used in each of the approximately 100 word categories in Harvard’s General Inquirer dictionary (e.g. percentage of positive words, food-related words, law-related words, etc). While this method did not yield as much insight as I had hoped [2], there was one interesting observation...

Talk about the company, not the candidate

There is a negative correlation between “you” pronouns (“you”, “your”, etc) and application, and a weaker positive correlation between “our” pronouns (“we”, “our”, etc) and application. This makes some sense: perhaps students enjoy reading about what a potential employer is like, rather than about what they must do or be. Perhaps seeing someone say that "you should have a solid knowledge of spreadsheet applications" is taken to be a bit aggressive. Incidentally, there is a negative correlation between “ought” words (“must”, “should”, etc) and application.

The word “you” came back again when I analyzed the correlations between application rates and the appearance [3] or increased use [4] of individual words (as opposed to word groups). Indeed there is a negative correlation between repeated use of the word “you” and application.

Good words, bad words

Several other words are correlated with application rates. Here are some words whose appearance or increased use is positively correlated with application rates:
Analysis, Capital, Construction, Design, Electrical, Energy, Engineers, Engineering, Excel, Mechanical, Projects, Toronto
Many of these words relate to the previous parts of “Mining Jobmine”, as they identify fields in low supply or high demand (which are apparently finance and engineering, especially mechanical engineering), and places that Waterloo students want to be (well, Toronto...). I’m not sure how to interpret the word “projects”.

As for words whose appearances are negatively correlated with application rates [5], there are actually more of these than "positive" words. Below is a partial list consisting of the most statistically significant words.
Application, Community, Development, Framework, fulltime, hours, HTML, Java, need, .NET, open, planning, Server, SQL, title, Unix, users, Web, Windows, within, XML
Again, the programming words in this list suggest that programming jobs are in low demand or high supply. Other words are hard to interpret: should employers refrain from talking about its hours, its fulltime employees, or about its users’ needs? Perhaps some of these correlations are spurious.

Junior, Intermediate, AND Senior

Each job posting on Jobmine has one or more “level” tags associated with it: Junior, Intermediate, and Senior. These tags describe the “level” of students that an employer seeks, and are used by students to search for jobs appropriate to their level. The plot below shows the mean application rates (and 95% confidence interval) of jobs with each set of tags, with the red line showing the mean application over all jobs.

In most cases, adding an extra “level” tag increases application rates by about 10. Adding an extra “level” tag would mean that more students are likely to see your job. The exceptions are, of course, those 7 jobs that are tagged Junior and Senior...

Avoid special instructions

Special instructions are red-coloured messages that appear above a job description in Jobmine. Employers use it to announce information sessions, to remind students to apply through their website, or for other reasons. Around 40% of job postings on Jobmine have special instructions, and these postings receive 6 fewer applications on average than postings without special instructions. This is quite a large difference - and statistically significant, too. Perhaps the contents of special messages turn applicants away? Perhaps people don’t like seeing big bright red messages when reading a job posting? Either way, including special instructions might have drawbacks that employers do not expect.


While most students spend hours perfecting their resumes, employers don’t always think as much about job descriptions. Yet these analyses show that a student’s decision to apply for a job can be influenced by factors other than the job itself. Some of these influences are marginal, while others are large. The analyses suggest that employers can increase the candidate pool by shortening job postings, rewording job descriptions, or by being cautious about using special instructions. Of course, an employer’s end goal is to find a suitable candidate, and so the quality of the candidate pool is more important than its size. Whether or not improving a job description is worth an employer’s time is another story -- especially since the effects of changing an individual job posting are uncertain.


[1] Application numbers are heavily skewed, so to satisfy the assumptions of the linear regression model we take the square root of application rate as our dependent variable. Number of words in a job posting is still our independent variable, and the curve we get is a quadratic.
[2] Several word categories showed statistically significant correlations with applications, but these correlations are hard to interpret because many word categories are filled with homonyms and questionable words. For example, the category “Land” contains words describing places occurring in nature, and is correlated with applications. However top words contained in this category are “field”, “range”, “bank” and “fall”. As another example, the words “time”, “service” and “fun” are considered “hostile” words in General Inquirer.
[3] To test the effect of the appearance of a word, I split up the jobs based on whether or not a particular word appeared in its job description, and used a two-sample non-paired t-test. Very uncommon words or very common words were ignored.
[4] To test the effect of the number of appearances of a word, I correlated the number of times a word appears in a job description and application, and calculated the p-value. This analysis was done only on words that appear more than 10 times in at least one job posting.
[5] All of these words are significant when [3] is applied to them.

Thursday, July 22, 2010

An Ideal Society

Designing the ideal society is an old puzzle. Many ideas were generated over time, and some have even been put into practise. We have tried many different ways to organize society: everything from monarchy to democracy to communism. Yet none of these systems have yet stood the test of time. Ideas that look great on paper often fail in practise.

I think that this is because when we design an ideal society, we allow ourselves to also design the citizens of that society. We allow our society to dictate how a human being should behave, and assume that they will behave as expected. For example, a communist society assumes that its citizens would give their best in return for others' best, and have all citizens' needs met together; that its citizens are willing to stand by the mantra: "from each according to his ability, to each according to his need".

But it's difficult, if not impossible, to convince every person to behave in a certain way. Nobody is perfect, and certainly there will be people whose interests conflict with that of society. Indeed some people will do anything they can to game whatever system that is in place.

So perhaps an ideal society is not what we need, because we are not ideal people. Perhaps we have been considering the wrong question all along. Instead of designing the ideal society, perhaps we should be designing a robust society. By “robust” I actually mean two things: First, that the society should still function if certain assumptions about the nature of its citizens are violated. Second, that the "locally optimal" behaviour for an individual should also be optimal for the society. (This is akin to the idea of evolutionary stability in "The Selfish Gene".)

As an example, we can see that communism fails at robustness: the "locally optimal" behaviour for a person would be to produce less and consume more, which is not optimal for society; and if a few people decide not to give their best, this game would become quite unfair to those who play by the rules, and so others are likely to also cheat.

Declaring that we have designed an ideal society when we take the liberty to design its citizens seems like a rather strange exercise. If we can decide how people would think and act, wouldn't any reasonable society we create be an ideal society? Design a society where citizens are required to give up their own children and raise a random person's, but design the citizens so that they understand why this is done (equal opportunity, perhaps?), and you have an “ideal” society.

... and yet we’ve only gone in circles. Tautologies are tautological.

End of Entry

Sunday, June 20, 2010

Mining Jobmine: Part 2. Demand and Supply


If you’re an Engineering or a Math student, you’re in luck. Despite not being to scale, the Venn diagram below shows that over 85% of jobs on Jobmine first round this term are targeted towards Math or Engineering students.

In fact, a third of the jobs on Jobmine target exclusively Math and Engineering students. Given that programming is a skill that many Math and Engineering students tend to have (or are forced to have), it’s tempting to suggest that these are programming jobs.

If you look at the list of most common words in job titles targeted towards Math and Engineering students, the words “Software”, “Developer” and “Engineering” top the list. To be fair though, if you look at the most common words in all job titles, you see the same three words in different order: “Engineering”, “Software” and “Developer”. Here are some of the other common words in job titles.

It's quite interesting that overall, employers like to refer to us most as a "student" -- then "coop" and "intern". Not so for employers targeting Math and Engineering students. They aren't as fond of referring to co-ops as "student" or "assistant".

My bias towards programmers should already be all too apparent (as I am often referred to one myself), so it shouldn't be surprising to ask this next question: What programming skills are in demand? A partial answer can be found by looking at the number of times each of the following programming related words appear in Jobmine job descriptions.

Okay, so the list of programming languages (and non-programming languages) I chose are quite arbitrary, but seriously? People are still looking for COBOL programmers?


Demand of co-op students is only half of the story. What about supply? To gage the supply of co-op students, we can look at the number of applications job postings targeting different faculties receive, shown below.

If you've never seen a box-whisker plot before: the thick line in the middle shows the median value, the box in the middle shows the middle 50% of the values, and the dotted line shows the range of values for number of applications per posting, excluding outliers. Note that outliers were omitted in order to keep the figure clean. Also, if a job posting targets both Arts and Math students for example, that job is taken into account in the plots of both categories.

So what do we see here? Job postings targeting Arts students get the highest number of applications, and applications targeting AHS (Applied Health Sciences) and ENV (Environment) students get the lowest number of applications.


You should be in Math or Engineering, Applied Health Sciences or Environment. You should pay attention in your programming courses. Learn programming. Knowing Java will help you too if you're desperate for jobs.

UPDATE: Fixed the programming language chart to fix over-counting of "R" (thanks Paul for noticing).

End of Entry

Friday, June 11, 2010

Guide to Happiness

Update: I replaced my terrible diagram with a link to the REAL one -- it's much prettier, and it's, um, real.

End of Entry

Wednesday, May 19, 2010

Mining Jobmine: Part 1. Map of the Jobs

First, the map (bigger version here). This map shows where the jobs posted during last weekend's job postings are located. A slightly more interactive version is available here, but takes forever to load.

Yes, there is a job WAY up in the Arctic, and 28 people applied to it. I could have sworn too that there used to be jobs in New Zealand, Hawaii and Australia, but oh well.

A similar map counting the number of applications going to each city is available here (this also takes forever to load). However the two maps look pretty much the same, since the resolution of the bubbles are not that great.

What's also not clear in the maps is the actual number of jobs in Waterloo and Toronto areas. To make it clear, here are the top 10 locations with the most number of job openings.

If we take the top 10 locations with the most number of applications, we get something similar.

Now what happens when we look at the cities that get the most applications per opening? We get something COMPLETELY different.

The more "exotic" places like Saint-Hubert (where the Candian Space Agency is located), California and others don't offer many jobs, but the ones that are offered attract a lot of applications. Note though that I would take the exact ordering in the last chart with a grain of salt for two reasons: (1) the position that actually attracted the most number of applications is offered in "Various Locations" -- take a WILD guess what company offered that position (hint: start's with a "G") and (2) a lot of companies lie about the number of openings they have.

What about the places with the lowest applications per job?

I'll leave you to come up with your own conclusions here.

Finally, a note about data and methodology. Jobmine is the system co-op students/employers at Waterloo use to manage job postings and applications. Job name, location, opening and applications data were pulled off of Jobmine 8am this morning (posting closed 12am last night). Some location names are changed slightly to avoid multiple entries per name, and so that Google's Geomap tool would map it correctly (it mapped "London" to "London, England", etc.). When a job opening was in multiple locations, I took the first one. Jobs that put "Multiple Locations" or "Various" or something ambiguous as their location were deleted.

I plan to squeeze more goodies out of this data, so stay tuned. Incidentally, if you are familiar with Google Charts API and know how to make it go faster, please let me know.

End of Entry

Wednesday, May 5, 2010

Monty Hall Problem: an intuitive explanation

The Monty Hall Problem is a probability puzzle based on a TV show. Here's the puzzle:
Suppose that you are on a game show, and the host shows you three doors. He tells you that behind two of the doors are goats, and behind one of the doors is a brand new luxury car. He asks you to pick one of the three doors, and if you picked the door with the car behind it, you keep the car. You pick a door (say door #1). The host, knowing which door has the car behind it, walks over to a different door (say door #2) and opens it to reveal a goat. He then offers you to a chance to change your mind (and switch to door #3). Should you make the switch?
If you haven't heard the puzzle a billion times already, think for a bit before reading on.

Here's the answer: you should switch. I'll give two explanations as to why. The first one will (hopefully) appeal to your intuition, and the second one will be an argument using probability.

The Intuitive Explanation

Let's change the game for a bit. Suppose instead of only 3 doors, we have 100 doors: with 99 goats still only one car behind the doors. After you pick a door (say door A), the host opens 98 doors to reveal 98 goats, only leaving one other door (say door B) closed. In this case, would you choose to switch (to door B)? Again think about this first before reading on.

The Probabilistic Explanation

Here's how you might have reasoned about the previous scenario: the only case where switching to door B is not beneficial is when you choose the right door the first time. That only has a 1% chance of happening.

The same reasoning applies to the 3 doors scenario. The only case where switching would not help you is when you choose the door with the car behind it the first time. There's a 33% chance of that happening, and a 66% chance of picking the wrong door. Thus you will double your probability of winning the car if you decide to switch.

End of Entry

Wednesday, March 17, 2010

logicomix and atlas shrugged

Though Logicomix is in no way historically accurate, it portrays well a feeling that I think a lot of us share. I'm sure that a lot of people saw the delicate beauty in Euclid's Elements, in the (relative?) certainty of mathematics, and hoped that something remotely similar to Elements can be made to solve dilemmas that come up in every day life--a consistent philosophical system deduced from the basic facts that everyone would understand and agree upon.

I had the naiive thought once. It's comforting to believe that there's a way to make a limited set of assumptions about life and existence, and derive from those a consistent set of beliefs about everything from metaphysics to ethics to politics.

From what other people are saying around the interweb, I'm not sure if Bertrand Russel would be the right person to attribute this set of feelings to. Actually, this might sound weird, but a more appropriate person would be Ayn Rand.

I recall in Atlas Shrugged, she speaks of people -- even philosophers (and logicians?) -- using logic to prove that logic is flawed/inadequate. These people were, of course, the "bad guys", the people that are held by the masses to be the "leader" of their chosen fields, but who are really there to foil the heros, the proponents of logic.

I loath hearing things about so and so "used logic to prove that logic is inadequate". I heard it once on numb3rs, too, so it's quite annoying. I had no idea that this in fact had been done. It's Gödel's Incompleteness Theorem. Of course!

Did Ayn Rand know about Gödel? Would that changed her mind in any way? If we make the assumptions that (1) human life/experience/society is much more complicated than the natural numbers, and (2) a system of philosophy (built up from a few axioms, for our purpose) can be thought of as a model of the world, then Gödel seems to imply that any of our philosophical system is either incomplete or inconsistent.

I'm probably missing something very important here.

End of Entry

Monday, February 15, 2010

Moment of Fame

I celebrated Chinese New Year this year by writing this article for the Facebook Data Team:

End of Entry

Sunday, February 7, 2010

The Impostor Syndrome

The Impostor Syndrome is a phenomenon where a person is incapable of internalizing success, often attributing it to luck, good timing, or pure fluke. The person is fearful that they might be "found out" to be less intelligent or competent than people make them to seem.

It's interesting that this phenomenon is prevalent enough to be both studied and given a name. Very interesting.

(And yes, I've felt this way many times, but it also seems that I am capable of internalizing success. Or maybe I'm just giving myself credit for effort, in which case I should remind you of this quote.)

End of Entry