05 August 2016

Fast & easy baseline text categorization with vw

About a month ago, the paper Bag of Tricks for Efficient Text Categorization was posted to arxiv. I found it thanks to Yoav Goldberg's rather incisive tweet:

Yoav is basically referring to the fact that the paper is all about (a) hashing features and (b) bigrams and (c) a projection that doesn't totally make sense to me, which (a) vw does by default (b) requires "--ngrams 2" and (c) I don't totally understand I don't think is necessary. (See this tutorial for more on how to do NLP in VW.)

At the time, I said if they gave me the data, I'd run vw on it and report results. They were nice enough to share the data but I never got around to running it. The code for their technique ("fastText") was just released, which goaded me into finally doing something.

So my goal here was to try to tell, without tuning any parameters, how competitive a baseline vw is to the results from fastText with minimal effort.

Here are the results:

ag news1
ag news23s92.55s92.3
amazon full1
amazon full233s60.269s56.6
amazon polarity1
amazon polarity252s94.668s94.2
sogou news1
sogou news236s96.830s96.9
yahoo answers1
yahoo answers227s72.348s71.0
yelp full1
yelp full218s63.937s60.0
yelp polarity1
yelp polarity215s95.720s95.5

(Average accuracy for fastText is 83.2; for vw is 82.2.)

In terms of accuracy, the two are roughly on par. vw occasionally wins; when it does, it's usually by 0.1% to 0.5%. fastText wins a bit more often, and on one dataset it wins significantly (yelp full: winning by 3%-4%) and on one a bit less (yahoo answers, up by about 1.3%). But the numbers are pretty much in line, and could almost certainly be brought up for vw with a wee bit of hyperparameter tuning (namely the learning rate, which is tuned in fastText).

In terms of training time, fastText is maybe 30% faster on average, though these are such small datasets (eg 500k examples) that a difference of 52s versus 68s is not too significant. I also noticed that for most of the datasets, simply writing the model to disk for vw took a nontrivial amount of time. But wait, there's more. That 30% faster for fastText was run on 20 cores in parallel whereas the vw run did not use parallelized learning (vw runs two threads, one for I/O and one for learning).
That said, a major caveat on comparing the training times. They're run on different machines. I don't know what type of machine the fastText results were achieved on, but it was a parallel 20-core run. The vw experiments were run on a single core, one pass over the data, on a 3.1Ghz Core i5-2400. Yes, I could have hogwild-ed vw and gotten it faster but it really didn't seem worth it for datasets this small. And yes, I could've rerun fastText on my machine, but... what can I say? I'm lazy.

What did I do to get these vw numbers? Here's the entire training script:
% cat run.sh 
for ngram in 1 2 ; do
  cat $d/train.csv | ./csv2vw.pl | \
    time vowpal_wabbit/vowpalwabbit/vw --oaa `cat $d/classes.txt | wc -l` \
                                  -b25 --ngram $ngram -f $d/model.$ngram
  cat $d/test.csv  | ./csv2vw.pl | \
    time vowpal_wabbit/vowpalwabbit/vw -t -i $d/model.$ngram
Basically the only flags to vw are (1) telling it to do multiclass classification with one-against-all, (2) telling it to use 25 bits (not tuned), and telling it to either use unigrams or bigrams. [Comparison note: this means vw is using 33m hash bins; fastText used 10m for unigram models and 100m for bigram models.]

The only(*) data munging that occurs is in csv2vw.pl, which is a lightweight script for converting the data, lowercasing, and doing very minor tokenization:
% cat csv2vw.pl
#!/usr/bin/perl -w
use strict;
while (<>) {
    if (/^"*([0-9]+)"*,"(.+)"*$/) {
        print $1 . ' | ';
        $_ = lc($2);
        s/","/ /g;
        s/([^a-z0-9 -\\]+)/ $1 /g;
        print $_ . "\n";
    } else { 
        die "malformed line '$_'";
There are two exceptions where I did slightly more data munging. The datasets released for dbpedia and Soguo were not properly shuffled, which makes online learning hard. I preprocessed the training data by randomly shuffling it. This took 2.4s for dbpedia and 12s for Soguo.

[[[EDIT 2:20p 5 Aug 2016: Out of curiosity, I upped the number of bits that vw uses for the experiments to 27 (so that it's on par with the 100m used by fastText). This makes it take about 5 seconds longer to run (writing the model to disk is slower). Performance stays the same on: ag news, amazon polarity, dbpedia, sogou, and yelp polarity; and it goes up from from 53.6/56.6 to 55.0/58.8 on amazon full, from 70.6/71.0 to 71.1/71.6 on yahoo answers, from 56.9/60.0 to 58.5/61.6 on yelp full. This puts the vw average with more bits at 82.6, which is 0.6% behind the fastText average.]]]

Long story short... am I switching from vw to fastText? Probably not any time soon.


Zachary Mayer said...

Where could I find the datasets you trained VW on? I'm interested in doing a little experimentation myself.

hal said...

Zachary: I think you'd have to contact the fastText authors; I'm not sure that I'm allowed to redistribute.

Xiang Zhang said...

The FastText paper is not the one that constructed these datasets. They were first used in the following paper
(I am the first author).

The datasets are available in a Google drive link.

They are huge compared to text classification datasets available in the past. It's just that linear models can be made extremely fast to make that matter less.

Zachary Mayer said...

@Xiang Zhang Can you share the google drive link?


Xiang Zhang said...

Zachary: oops I intended to put the link in the previous comment but forgot do really paste it...


hal said...

Thanks, Xiang!