natural language processing blog: Fast & easy baseline text categorization with vw

05 August 2016

Fast & easy baseline text categorization with vw

About a month ago, the paper Bag of Tricks for Efficient Text Categorization was posted to arxiv. I found it thanks to Yoav Goldberg's rather incisive tweet:

Yoav is basically referring to the fact that the paper is all about (a) hashing features and (b) bigrams and (c) a projection that doesn't totally make sense to me, which (a) vw does by default (b) requires "--ngrams 2" and (c) I don't totally understand I don't think is necessary. (See this tutorial for more on how to do NLP in VW.)

At the time, I said if they gave me the data, I'd run vw on it and report results. They were nice enough to share the data but I never got around to running it. The code for their technique ("fastText") was just released, which goaded me into finally doing something.

So my goal here was to try to tell, without tuning any parameters, how competitive a baseline vw is to the results from fastText with minimal effort.

Here are the results:

		fastText		vw
Dataset	ng	time	acc	time	acc
ag news	1		91.5	2s	91.9
ag news	2	3s	92.5	5s	92.3
amazon full	1		55.8	47s	53.6
amazon full	2	33s	60.2	69s	56.6
amazon polarity	1		91.2	46s	91.3
amazon polarity	2	52s	94.6	68s	94.2
dbpedia	1		98.1	8s	98.4
dbpedia	2	8s	98.6	17s	98.7
sogou news	1		93.9	25s	93.6
sogou news	2	36s	96.8	30s	96.9
yahoo answers	1		72.0	30s	70.6
yahoo answers	2	27s	72.3	48s	71.0
yelp full	1		60.4	16s	56.9
yelp full	2	18s	63.9	37s	60.0
yelp polarity	1		93.8	10s	93.6
yelp polarity	2	15s	95.7	20s	95.5

(Average accuracy for fastText is 83.2; for vw is 82.2.)

In terms of accuracy, the two are roughly on par. vw occasionally wins; when it does, it's usually by 0.1% to 0.5%. fastText wins a bit more often, and on one dataset it wins significantly (yelp full: winning by 3%-4%) and on one a bit less (yahoo answers, up by about 1.3%). But the numbers are pretty much in line, and could almost certainly be brought up for vw with a wee bit of hyperparameter tuning (namely the learning rate, which is tuned in fastText).

In terms of training time, fastText is maybe 30% faster on average, though these are such small datasets (eg 500k examples) that a difference of 52s versus 68s is not too significant. I also noticed that for most of the datasets, simply writing the model to disk for vw took a nontrivial amount of time. But wait, there's more. That 30% faster for fastText was run on 20 cores in parallel whereas the vw run did not use parallelized learning (vw runs two threads, one for I/O and one for learning).
That said, a major caveat on comparing the training times. They're run on different machines. I don't know what type of machine the fastText results were achieved on, but it was a parallel 20-core run. The vw experiments were run on a single core, one pass over the data, on a 3.1Ghz Core i5-2400. Yes, I could have hogwild-ed vw and gotten it faster but it really didn't seem worth it for datasets this small. And yes, I could've rerun fastText on my machine, but... what can I say? I'm lazy.

What did I do to get these vw numbers? Here's the entire training script:

% cat run.sh 
#!/bin/bash
d=$1
for ngram in 1 2 ; do
  cat $d/train.csv | ./csv2vw.pl | \
    time vowpal_wabbit/vowpalwabbit/vw --oaa `cat $d/classes.txt | wc -l` \
                                  -b25 --ngram $ngram -f $d/model.$ngram
  cat $d/test.csv  | ./csv2vw.pl | \
    time vowpal_wabbit/vowpalwabbit/vw -t -i $d/model.$ngram
done

Basically the only flags to vw are (1) telling it to do multiclass classification with one-against-all, (2) telling it to use 25 bits (not tuned), and telling it to either use unigrams or bigrams. [Comparison note: this means vw is using 33m hash bins; fastText used 10m for unigram models and 100m for bigram models.]

The only(*) data munging that occurs is in csv2vw.pl, which is a lightweight script for converting the data, lowercasing, and doing very minor tokenization:

% cat csv2vw.pl
#!/usr/bin/perl -w
use strict;
while (<>) {
    chomp;
    if (/^"*([0-9]+)"*,"(.+)"*$/) {
        print $1 . ' | ';
        $_ = lc($2);
        s/","/ /g;
        s/""/"/g;
        s/([^a-z0-9 -\\]+)/ $1 /g;
        s/:/C/g;
        s/\|/P/g;
        print $_ . "\n";
    } else { 
        die "malformed line '$_'";
    }
}

There are two exceptions where I did slightly more data munging. The datasets released for dbpedia and Soguo were not properly shuffled, which makes online learning hard. I preprocessed the training data by randomly shuffling it. This took 2.4s for dbpedia and 12s for Soguo.

[[[EDIT 2:20p 5 Aug 2016: Out of curiosity, I upped the number of bits that vw uses for the experiments to 27 (so that it's on par with the 100m used by fastText). This makes it take about 5 seconds longer to run (writing the model to disk is slower). Performance stays the same on: ag news, amazon polarity, dbpedia, sogou, and yelp polarity; and it goes up from from 53.6/56.6 to 55.0/58.8 on amazon full, from 70.6/71.0 to 71.1/71.6 on yahoo answers, from 56.9/60.0 to 58.5/61.6 on yelp full. This puts the vw average with more bits at 82.6, which is 0.6% behind the fastText average.]]]

Long story short... am I switching from vw to fastText? Probably not any time soon.

6 comments:

Anonymous said...: Where could I find the datasets you trained VW on? I'm interested in doing a little experimentation myself.; 05 August, 2016 10:49
hal said...: Zachary: I think you'd have to contact the fastText authors; I'm not sure that I'm allowed to redistribute.; 05 August, 2016 11:19
Xiang Zhang said...: The FastText paper is not the one that constructed these datasets. They were first used in the following paper
http://arxiv.org/abs/1509.01626
(I am the first author).

The datasets are available in a Google drive link.

They are huge compared to text classification datasets available in the past. It's just that linear models can be made extremely fast to make that matter less.; 05 August, 2016 12:07
Anonymous said...: @Xiang Zhang Can you share the google drive link?

Thanks!; 05 August, 2016 12:17
Xiang Zhang said...: Zachary: oops I intended to put the link in the previous comment but forgot do really paste it...

http://goo.gl/JyCnZq; 05 August, 2016 12:36
hal said...: Thanks, Xiang!; 05 August, 2016 12:44

natural language processing blog

05 August 2016

Fast & easy baseline text categorization with vw

6 comments:

About Me

Labels

My Blog List

Blog Archive