Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

20 February 2010

Google 5gram corpus has unreasonable 5grams

In the context of something completely unrelated, I was looking for a fairly general pattern in the Google 1TB corpus. In particular, I was looking for verbs that are sort of transitive. I did a quick grep for 5grams of the form "the SOMETHING BLAHed the SOMETHING." Or, more specifically:

    grep -i '^the [a-z][a-z]* [a-z][a-z]*ed the [a-z]*'
I then took these, lower cased them, and then merged the counts. Here are the top 25, sorted and with counts:
     1  101500  the surveyor observed the use
2 30619 the rivals shattered the farm
3 27999 the link entitled the names
4 22928 the trolls ambushed the dwarfs
5 22843 the dwarfs ambushed the trolls
6 21427 the poet wicked the woman
7 15644 the software helped the learning
8 13481 the commission released the section
9 12273 the mayor declared the motion
10 11046 the player finished the year
11 10809 the chicken crossed the road
12 8968 the court denied the motion
13 8198 the president declared the bill
14 7890 the board approved the following
15 7848 the bill passed the house
16 7373 the fat feed the muscle
17 7362 the report presented the findings
18 7115 the committee considered the report
19 6956 the respondent registered the domain
20 6923 the chairman declared the motion
21 6767 the court rejected the argument
22 6307 the court instructed the jury
23 5962 the complaint satisfied the formal
24 5688 the lord blessed the sabbath
25 5486 the bill passed the senate
What the heck?! First of all, the first one is shocking, but maybe you could convince me. How about numbers 4 and 5? "The trolls ambushed the dwarfs" (and vice versa)? These things are the fourth and fifth most common five grams matching my pattern on the web? "The poet wicked the woman"? What does "wicked" even mean? And yet these all beat out "The bill passed the house" and "The court instructed the jury". But then #23: "The prince compiled the Mishna"??? (#30 is also funny: "the matrix reloaded the matrix" is an amusing segmentation issue.)

If we do a vanilla google search for the counts of some of these, we get:
     1     10900  the surveyor observed the use
4 7750 the trolls ambushed the dwarfs
5 7190 the dwarfs ambushed the trolls
6 ZERO! the poet wicked the woman
15 20200000 the bill passed the house
22 3600000 the court instructed the jury
This just flabbergasts me. I'm told that lots of people have expressed worries over the Google 1TB corpus, but have never actually heard anything myself... And never seen anything myself.

Does anyone have an explanation for these effects? How can I expect to get anything done with such ridiculous data!

23 June 2008

Help! Contribute the LaTeX for your ACL papers!

This is a request to the community. If you have published a paper in an ACL-related venue in the past ten years or so, please consider contributing the LaTeX source. Please also consider contributing talk slides! It's an relatively painless process: just point your browser here and upload! (Note that we're specifically requesting that associated style files be included, though figures are not necessary.)