natural language processing blog: Google 5gram corpus has unreasonable 5grams

20 February 2010

Google 5gram corpus has unreasonable 5grams

In the context of something completely unrelated, I was looking for a fairly general pattern in the Google 1TB corpus. In particular, I was looking for verbs that are sort of transitive. I did a quick grep for 5grams of the form "the SOMETHING BLAHed the SOMETHING." Or, more specifically:

    grep -i '^the [a-z][a-z]* [a-z][a-z]*ed the [a-z]*'

I then took these, lower cased them, and then merged the counts. Here are the top 25, sorted and with counts:

     1  101500  the surveyor observed the use
    2   30619  the rivals shattered the farm
    3   27999  the link entitled the names
    4   22928  the trolls ambushed the dwarfs
    5   22843  the dwarfs ambushed the trolls
    6   21427  the poet wicked the woman
    7   15644  the software helped the learning
    8   13481  the commission released the section
    9   12273  the mayor declared the motion
   10   11046  the player finished the year
   11   10809  the chicken crossed the road
   12    8968  the court denied the motion
   13    8198  the president declared the bill
   14    7890  the board approved the following
   15    7848  the bill passed the house
   16    7373  the fat feed the muscle
   17    7362  the report presented the findings
   18    7115  the committee considered the report
   19    6956  the respondent registered the domain
   20    6923  the chairman declared the motion
   21    6767  the court rejected the argument
   22    6307  the court instructed the jury
   23    5962  the complaint satisfied the formal
   24    5688  the lord blessed the sabbath
   25    5486  the bill passed the senate

What the heck?! First of all, the first one is shocking, but maybe you could convince me. How about numbers 4 and 5? "The trolls ambushed the dwarfs" (and vice versa)? These things are the fourth and fifth most common five grams matching my pattern on the web? "The poet wicked the woman"? What does "wicked" even mean? And yet these all beat out "The bill passed the house" and "The court instructed the jury". But then #23: "The prince compiled the Mishna"??? (#30 is also funny: "the matrix reloaded the matrix" is an amusing segmentation issue.)

If we do a vanilla google search for the counts of some of these, we get:

     1     10900  the surveyor observed the use
    4      7750  the trolls ambushed the dwarfs
    5      7190  the dwarfs ambushed the trolls
    6     ZERO!  the poet wicked the woman
   15  20200000  the bill passed the house
   22   3600000  the court instructed the jury

This just flabbergasts me. I'm told that lots of people have expressed worries over the Google 1TB corpus, but have never actually heard anything myself... And never seen anything myself.

Does anyone have an explanation for these effects? How can I expect to get anything done with such ridiculous data!

41 comments:

Rachel Cotterill20 February, 2010 14:46
That really doesn't sound like representative data! Have you contacted the folks at Google to find out what might be causing such odd effects?
ReplyDelete
Replies
Chris20 February, 2010 17:06
I've heard of similar discoveries of problems in this data set (pretty informal, conversations at GALE meetings, NIST workshops).

Another piece of evidence that says they may be significantly screwed up is that no one is using them in MT, at least, no one "major" finds them to be indispensable. There's usually a bit of hedging (e.g. things like "we had to do some stupid pruning, so the fact we didn't see significant gains can be attributed to that"). I'm quite sure that if these were really reasonable 5-gram counts, someone in the MT rat race would have found a way to use them.
ReplyDelete
Replies
gromgull21 February, 2010 02:20
This is very odd. I tried to confirm it though, and this 5-gram is NOT in my copy of corpus.

The line would have to be in 5gm-0106.gz file on the 6th DVD (the 106 file starts with "the legendary", and goes until "the specific")

zgrep -i wicked 5gm-0106.gz | grep -i poet

gives me only:

the poet of wickedness also 91

Using your pattern, and grep'ing for poet I find only:

the poet used the muse 108

Did I miss anything? Could it be in a different file? Did you aggregate the files from each 5gm-XXX file and jumble them or something?
ReplyDelete
Replies
gromgull21 February, 2010 02:36
BUT - "the bill passed the house" line only occurs 2429 times in my files (3 different capitalisations)

which is clearly less than the millions of times it appears on the web, so something IS a bit fishy.
ReplyDelete
Replies
gromgull21 February, 2010 02:39
Drat - I am of course wrong, "the Poet Wicked The Woman 21427" occurs in a different file because capital letters are ordered first. Nevermind, that'll teach me to grep before breakfast.

Last comment now :)
ReplyDelete
Replies
Robin21 February, 2010 05:43
I think the algorithm google use to collect the corpus is buggy. Only human verified corpus could be perfect but for extremely large corpus it is too difficult to verify. Only solution could be to optimize the algo further.
ReplyDelete
Replies
Spanish school in Spain21 February, 2010 11:50
Thanks for the this usefull information try this one spanish school in spain.
ReplyDelete
Replies
Lletraferit22 February, 2010 00:09
Hi,

Very revealing post, thanks! (who knows how many weeks of useless debugging I haven't wasted trying to debug algorithms because of assuming the corpus was right! Believing in the data is a scientist's classic mistake ;-)

I think there was too much expectation regarding Google's corpus. Until now, I have no compelling reasons to consider their research particularly impressive, and I have no proof that they care or are even aware of how to build a corpus, either. Actually, Google is known to be generally dismissive of NLP and to rely on ever-increasing amounts of data to build their applications. That's an excellent foundation, but clearly has some unaddressed problems.

As for the comment about using this corpus for MT, I am not sure that would constitute a promising approach. The most obvious use of corpora in MT involves parallel corpora, and I wouldn't expect a corpus which fails to fulfill minimum quality standards to have been parallelized.

I agree with the commenter who says Google's scraper may be buggy. If you are not familiar with it, I strongly recommend you to look for information on the WaCky corpus, built by NLP professionals with a methodology (for a change ;-) and much more reliably:

http://wacky.sslmit.unibo.it/doku.php?id=start
ReplyDelete
Replies
Brendan O'Connor22 February, 2010 08:53
it's worth keeping in mind that there is no such thing as a uniform or even representative sample of the web. the sample space of "the web" isn't defined since there are an infinite number of autogenerated pages, pages you can't always get to, and the like. For example, maybe the scraper found a game website and kept clicking the "attack the troll" button and generated 20,000 pages of a dwarf attacking a troll. or whatever.

i feel like that people hope that "the web" is a big textual corpus that's fairly representative of typical ways to use language, but no one has ever done a real investigation to find out under what circumstances this is true.

from what i've heard, web page deduplication is a really hard problem but also a really critical one for this. maybe whoever did the scrape used a crappy deduplication scheme -- that might explain the dwarf/troll thing, for example.
ReplyDelete
Replies
Brendan O'Connor22 February, 2010 08:53
Oops. Trolls attack dwarfs. Being sloppy here :)
ReplyDelete
Replies
Kenahoo22 February, 2010 08:54
Lletraferit: I think people generally believe(d) this corpus *is* the "ever-increasing amounts of data" you mention. Or at least well-sampled from it.
ReplyDelete
Replies
Tom22 February, 2010 11:31
I agree with Brendan. The current google web results have different postprocessing and filtering than the 5-gram corpus results, and that could lead to the discrepancies you see.
ReplyDelete
Replies
hal22 February, 2010 11:35
Brendan, Tom: Yes, I think it's clear that the ngram counts != Google counts. That's fine with me. Surely everyone (I hope) knows that there are dedup issues for anything on the web. But even taking that stuff into account, I can't reconcile in my head the differences, especially the 20k that turned into a zero!

I guess what I'm saying is that I would expect some sort of vague correlation between the counts. Obviously I don't expect them to be the same, or even on the same order of magnitude.

But these counts just look random.
ReplyDelete
Replies
Tom22 February, 2010 11:53
20k could turn into zero if they all came from the same set of pages that got filtered out later. (e.g. a spam site with fake content designed to manipulate pagerank).

Aside from those cases, I think that overall the counts will still have correlation to the "actual google counts" though. If you found the actual google counts for the 5-gram results ranked #1-100, #1001-1100, #5001-5100, etc, I think that the higher ranked sets will have higher average actual google counts than the lower ranked sets.
ReplyDelete
Replies
Unknown22 February, 2010 12:06
This comment has been removed by the author.
ReplyDelete
Replies
Unknown22 February, 2010 12:18
Those 20K+ occurrences of trolls/dwarfs in the n-grams, like the 7K hits on the Web, are basically all from the same sentence (repeated over and over on the Web).

I wouldn't say the Google n-gram counts are bad -- the repetition of that sentence, like the other strange ones, is a legit phenomenon on the Web. However, I completely agree that for some NLP applications, the counts might not be the best ones.

It's often more informative to count each unique sentence on the Web only once. If you do that, even over a billion Web pages, you count "the trolls ambushed the dwarfs" only a handful of times (e.g., I got 5 occurrences of that when searching for "ambushed" as a predicate in TextRunner, which btw might be a better source of counts for your needs here).
ReplyDelete
Replies
Unknown22 February, 2010 13:26
You're just biased against people who like to read "The Hobbit." Lots. :-)
ReplyDelete
Replies
zmccord22 February, 2010 14:41
I blame popular author Terry Pratchett for (4) and (5): http://wiki.lspace.org/wiki/Battle_of_Koom_Valley
ReplyDelete
Replies
Marine23 February, 2010 16:11
This reminds me of another Google page count weirdness story from a few years ago (but the issues were apparently fixed before the 5gram corpus was collected).

The 5gram corpus has been used successfully for disambiguation of prepositions and non-referential pronouns: closed-class word patterns are probably less sensitive to unexplained occurrences of trolls and poets...
ReplyDelete
Replies
DrNI@AM23 February, 2010 16:18
The thing with the dwarfs makes some sense at least. It appears to me that this is due to a very common issue in creating corpora from the web: duplicates and near-duplicates. The web is full of copy-paste text. These duplicate snippets are not always easy to spot, there are several algorithms but they tend to fail if the copied text snippet is short. Googling reveals that the dwarf-thing actually comes from a review of a book by Terry Pratchett. Quite naturally, every online book store will have the standard review written by the publisher in the article description.

Google hit counts can change very quickly. As pointed out in other comments, Google may just block a site from being indexed, or the Terry Pratchett book might run out of stock world wide and disappear from the shops.
ReplyDelete
Replies
Koen Deschacht24 February, 2010 05:32
Maybe there is an easier and less suspicious explanation: every top-ranked 5-gram will almost by definition be an outlier. The chance of observing a random sequence of 5 words is very, very small. Of course language is not random, but also the chance of observing a sequence of 5 words actually used in language is still very small. Thus you could predict in advance that the top-ranked 5-grams will not be "standard" language. This however holds for every corpus.
ReplyDelete
Replies
Anonymous28 February, 2010 06:12
sorry to bring in a bit of ethnic knowledge, but "the Prince compiled the Mishna" probably refers to a well-known event in the history of Judaism when rabbi Juhuda HaNasi (Juda the Prince) codified basic elements of the jewish law
ReplyDelete
Replies
Bob Moore28 February, 2010 11:56
I think there is link spamming going on here. When I searched for "the surveyor observed the use", Google returned the following as the fourth ranked search result:

e-Commerce Writers and Academician | XING1 101500 the surveyor observed the use 2 30619 the rivals shattered the farm 3 27999 the link entitled the names 4 22928 the trolls ambushed the dwarfs ...
www.xing.com/net/ecomwriteracademic - Cached -

Notice that the first four of Hal's weird 5-grams show up just in this small snippet. When I checked the cached copy of the page that was pointed to, Google told me that the search term occurred only in the referring pages. Probably someone has created a link farm using these 5-grams to boost the rank of the pages they point to when these terms are used as search queries
ReplyDelete
Replies
Bob Moore28 February, 2010 12:02
Oops! The snippet I displayed above not only contains the phrases, but also the counts! So it must have been created from Hal's original post.
ReplyDelete
Replies
Buy Research Paper01 March, 2010 05:40
Many institutions limit access to their online information. Making this information available will be an asset to all.
ReplyDelete
Replies
Anonymous07 March, 2010 22:00
IIRC, this data truncated all N-grams with fewer than 50 hits.

Can that explain the weirdness you've seen?
ReplyDelete
Replies
Term Paper13 March, 2010 00:52
A great article indeed and a very detailed, realistic and superb analysis of the scenarios. I would like to thank the author of this article for contributing such a lovely and mind-opening article.
ReplyDelete
Replies
SEO Content Girl23 March, 2010 12:35
This was very useful, thanks a lot for this.seo services
ReplyDelete
Replies
Anonymous13 April, 2010 20:07
cheap nike shox
cheap sport shoes
nike tn dollar
ed hardy ugg boots
ed hardy love kills slowly
ed hardy clothing us
ed hardy clothing
cheap ed hardy
cheap ed hardy clothing
ed hardy clothes
ed hardy wholesale
ed hardy clothing
ed hardy t shirts
ed hardy shirts
ed hardy uk
ed hardy t shirts
ed hardy shirts
ed hardy hoodies
Cheap JORDAN SHOES，，
cheap nike max ，。
puma future cat
ed hardy ugg boots.
ed hardy love kills slowly boots.
ed hardy love kills slowly.
ed hardy polo shirts.
cheap ed hardy clothing,.
ed hardy shirts .
ed hardy t shirts.,.
ReplyDelete
Replies
Anonymous14 April, 2010 16:16
"The poet wicked the woman" is a segmentation issue too, plus a misunderstanding.

This is a list of plays!
Recently on Broadway were "A Touch of the Poet," "Wicked," and "The Woman in White."

"Wicked" is being read as a past-tense verb because "to wick" is a verb, and technically "wicked" can also mean "sucked up moisture."
ReplyDelete
Replies
gamefan1215 April, 2010 10:16
This is definitely the future. I think you should push this more and more. This is the future.
jacksonville cosmetic dentist
ReplyDelete
Replies
Unknown29 April, 2010 13:38
Not that I am defending the data but here are some variables to consider...

Could it be spam?
Spam can be seemingly random but the same thing may posted by bots everywhere, four years later this post is still getting spammed.

The web changes fast.
The data is from 2006 and the google algorithm has vastly improved since then. They have only recently added contextualization. Maybe now they are able to weed out oddities and segment a list of plays now.

Whether the data is correct and/or valid does not change our need to verify its usefulness which is why I greatly appreciated your post.
ReplyDelete
Replies
Michael03 June, 2010 04:11
Official N gram

yours also a nice post .

Interview questions (Employers Choice)
ReplyDelete
Replies
Term papers12 June, 2010 03:23
Hi, nice post. I have been thinking about this topic,so thanks for sharing. I will likely be coming back to your blog. Keep up the good work
ReplyDelete
Replies
Anonymous16 June, 2010 01:00
Really? A natural language processing blog and you can't see that you may have a language snippet cut out of context? "the surveyor observed the use" -> The auditing team assigned to 'the surveyor observed the use' of high quality tools and noted it in their reports.
ReplyDelete
Replies
Anonymous16 June, 2010 08:21
To the anonymous led here from hacker news, the post is talking about relative frequency, not whether that snippet exists or not (he showed it does right in the post, just at a much lower frequency than others that are actually more representative).
ReplyDelete
Replies
Pandora bracelets11 July, 2010 19:44
For those defective to buy Pandora see after today's announcement, the band will still not a bargain. Beads Bangles Now the band says that "substantially advanced" day duty are Pandora Bangles Sale being booked for the fourth billet and 2005. Beads bracelets The reports is so good that three brokerage firms raised their rating buy pandora bracelets on justness (ROE) to the keep selling at $34.75 a split, pandora necklace beads two cents in this year and $0.90 a divide next year. Pandora necklace sale While the newscast is doing well, But the provide is still probably be trading for buy pandora charm 2004 were actually below 2003 charge excepting for jack-up rigs Pandora beads 2010 (31% of the fleet) -- yet the troupe made money because of pandora beads charms elevated utilization toll. Analysts were estimating the pandora beads sale guests would earn $0.05 a disclose this commerce without discount pandora beads paying a high value to return for Diamond or its better competitor, new pandora beads Transocean (NYSE: RIG), should also be considered. Pandora beads 2010 The challenges that require to be transferred to microparticles like pearls on the ribbon. pandora sets They spread through a complex, and sentence habits to examine and rule cheap pandora sets pointed momentum. The researchers have showed that does not reach out discount pandora set as laser tweezers because it travels. Key to the procedure could pandora set sale be used to itinerary light
ReplyDelete
Replies
trustme18 July, 2010 21:15
This is really a nice blog, I appreciate you for telling us so nice things, thank you!By the way, if you like nike tn you can come here to choose! We have a

lot of
nike tn,tn chaussures,
nike tn chaussures
nike tn requin chaussures,nike air max tn chaussures.
nike homme chaussures,
nike femme chaususres,
nike enfant chaussres,
MBT France
vibram
If you want to find the shoes according to the sorts, then here you can have the informations,
we classied the shoes in nike presto,
nike air max,
nike air rift ninja,
tn requin,tn pas cher
vibram

fivefingers,
converse.
At the same time, the vibram also offer you in our store.
You also can choose the most fashionable sunglasses here, it really can make you different from other people. We have
sunglasses,designer sunglasses,
wholesale sunglasses,sunglasses discount in USA.
They includ men's sunglasses,women's sunglasses.
So many fashion brands are for you,like Dior Sunglasses,
Emporio Armani Sunglasses,
Fendi Sunglasses,
Giorgio Armani Sunglasses,
Gucci Sunglasses,
LV Sunglasses and so on.
ReplyDelete
Replies
Send flowers to poland20 July, 2010 01:44
These articles are fantastic; the information you show us is interesting for everybody and is really good written. It’s just great!! Do you want to know something more? Read it...: Great Flowers delivery service through flower2world.
ReplyDelete
Replies
Anonymous21 July, 2010 23:54
Harry Winston gave it the interesting name the Spanish Inquisition necklace. new pandora bangles It was a necklace that Most Famous PatientAncient pandora bangles uk Homeowner Association Rules Buckhannon, West Virginia: The pandora necklace silver Perfect BirthplaceFor German Butchers, a Wurst Case cheap pandora necklace ScenarioThe Most Ferocious Man-Eating LionsMyths of the American pandora necklace uk RevolutionThe Truth About LionsA Brief History of pandora silver necklace the Salem Witch TrialsNASA's New Lunar RoverChildren of the Vietnam pandora charms uk WarContemporary ArtFinding America's Heart by Harley
Smithsonian.com pandora charms sale expands on Smithsonian magazine's in-gravity pandora charms 2010 coverage of history, knowledge, quality, beads for pandora
ReplyDelete
Replies
Flowers store UK22 July, 2010 01:49
I recently came across your blog and have been reading along. I think I will leave my first comment. I don’t know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.
ReplyDelete
Replies

Add comment