Wednesday, September 19, 2007

search engines and such

On Language Log, Barbara Partee comments on a counter-intuitive set of Google search results:

Google peculiarities: When I tried to get a rough Google comparison of "biggest * of any of the other" vs. "biggest * of any of the", I actually seemed to get a much bigger number for the first, though it should be a subset of the second. I got 106,000,000 for the first and just 12,800 for the second! But then with some help from Kai von Fintel and David Beaver, it was discovered that Google behaves very strangely with some ungrammatical strings. Closer inspection of the return from the search that seemed to give 106,000,000 hits shows that it returns only 3 pages of results, with the number 106,000,000 at the top of pages 1 and 2, but the number 21 on page 3, and in fact it only returned 21 hits.

David sleuthed out the phenomenon; here's his report.

***********

Unfortunately, the numbers given as results of google searches have become less meaningful over the last few years rather than improving in any sense relevant to us. The numbers google gives in response to a query are not counts of the number of pages with the given string. Rather, they are estimates based on a formula that, so far as I know, is not public. For simple searches, the estimate is presumably based on a calculation of the probability of the page having all the search terms based on the number of pages in the google caches for each of the component terms. But once you start doing string searches, this sort of approach becomes very unreliable.

I assume that the oddity of the result for "biggest * of any of the other" occurs because Google doesn't have any smart way to calculate the likelihood of strings for which the number of responses appears too large to simply count them. That is, I guess the algorithm works by first putting some bounds on the likely number of hits based on e.g. how rapidly various google network nodes appear to be sending responses, and if that number is sufficiently small, then google uses some fairly accurate algorithm for estimating the total, like counting every single response. But if there appear to be loads of responses, then the algorithm makes an estimate based on, well, who knows what. In the case at hand (and similarly for "smallest * of any of the other", "largest * of any of the other"), the estimate assumes some distributional properties that just don't hold for semantically or syntactically anomalous strings. Then, as you start going through the hits, Google is forced to self-correct as soon as you force it to actually enumerate all the results.

Hmm. So, if I'm right, then Barbara has stumbled on a rather interesting test for grammatical anomaly (though only relative to Google's bizarre assumptions about normality). Lets try another case: "* who thinks that is happy". This one has pretty damn ordinary set of words in it, but suffers from an unfortunate case of a missing subject. Here Google initially estimates 10,900 results. But then it rapidly revises down to 16...
[article with the questison that started this off here]

Meanwhile Joel Spolsky has a prophetic post claiming that Gmail will be the WordPerfect of e-mail here.

1 comment:

Anatoly said...

It's unfortunate that Google's estimates of the number of results is often so inconsistent. I've noticed it myself over the years, and it seems like it used to be much better.