Using fastText to classify reddit submissions

A quick and dirty tutorial

Will Radford - @wejradford

fastText

Concerted programme of research from Facebook towards practical text classification. Three key papers from 2016:

Open source at GitHub and elsewhere

How does it work?

title

  • Lowest layer maps n-grams to d-dimension word embeddings (à la word2vec) smushed together with averaging.
  • Hidden layer maps each dimension to a weight.
  • Top layer is the label.

Mixing neural and shallow models

  • Word embeddings for a task-specific representation
  • Linear models for fast learning and prediction

$\rightarrow$ Faster and hopefully better than bag-of-words

Dealing with large vocabulary

  • Word n-grams give more context than words alone
    >>> tokens('the cat sat on the mat')
    ['the', 'cat', 'sat', 'on', 'the', 'mat']
    >>> ngrams('the cat sat on the mat')
    ['START_the', 'the_cat', 'cat_sat', 'sat_on', 'on_the', 'the_mat', 'mat_END']
    
  • But:

    • Much sparser features (more zeros than non-zeros)
    • Need to maintain the mappings from sat_on to feature_12345, which costs memory and slows multicore learning
  • The hashing trick (see also vowpal wabbit maps n-grams into a constrained memory space

$\rightarrow$ More memory-efficient and parallelisable learning

Classifying in large label spaces

  • Linear classifiers run at $O(fk)$ where:
    • $f$ the number of features (i.e. dimensions)
    • $k$ is the number of labels
  • Hierarchical softmax reduces this to $O(f\log_{2}k)$ by building a tree of labels arranged by probability.

$\rightarrow$ Faster with some accuracy cost

Compressing models

  • Large models limit the deployment environment and increase load time
  • Vector quantisation lossily compresses the model

$\rightarrow$ Smaller models at some accuracy cost

Handling unseen words

  • Bag-of-words limits the model to handling those words seen in training (maybe not for hashing...).
  • Learn character n-grams embeddings to handle morphology (e.g. "make", "making", "mako") and unknown words.
  • Pre-initialising word embeddings with those learned from different corpora.

$\rightarrow$ More robust models

A toy task

$\rightarrow$ Given the title, predict which subreddit it belongs to.

Preprocessing

  • Take each post
  • Find the first sentence
  • Apply a really dumb tokeniser to split words !
  • Use the subreddit topic as the label

Streaming through reddit submissions

  • First 50K used to find the top-20 common labels (plus other), and for training embeddings
  • Next 10K for training
  • Next 10K for development
  • Next 1K for test

Training the model

  • Wrap command-line fastText using some python
  • Call and predict, returning:
    • Precision
    • Recall
    • F1
    • Training time
    • Model size

Sanity-check baseline

  • Using scikit-learn, nothing fancy
  • Bag of words (unigrams and bigrams)
  • Logistic regression prediction

Data

In [31]:
data_stats = pd.DataFrame([{'Label': k, 'Count': v} for k, v in 
              Counter((i[0] for i in train)).items()]).sort_values('Count', ascending=False)
data_stats[:10]
Out[31]:
Count Label
16 1700 AutoNewspaper
11 1248 AskReddit
1 1085 GlobalOffensiveTrade
6 740 news
4 691 RocketLeagueExchange
7 637 newsbotbot
17 473 The_Donald
5 472 business
15 459 videos
8 389 Showerthoughts
In [32]:
data_stats[10:]
Out[32]:
Count Label
2 349 funny
3 278 me_irl
9 272 gameofthrones
14 262 gaming
13 225 aww
10 207 pics
18 172 Overwatch
19 151 MemezForDayz
12 140 gonewild
20 49 Ice_Poseidon
0 1 ShittyCarMod
In [33]:
data_df = pd.DataFrame([{'Text': t, 'Label': l} for l, t in train[:10]])
data_df
Out[33]:
Label Text
0 ShittyCarMod Maserati with an Aerodynamic Spoiler
1 GlobalOffensiveTrade [Store] Double moses M9 #355, Dual side blue g...
2 funny Adam Mišík – Tak pojď
3 me_irl me 🌞irl
4 RocketLeagueExchange [ps4] [h] crimsom mantis [w] offers key prefer
5 business Edible Prints on Cake
6 funny RT @naukhaizs: #pakistan #karachi #lahore #isl...
7 news 10 ảnh avatar buồn , thất tình , tâm trạng dàn...
8 newsbotbot @washingtonpost: For Facebook , erasing hate s...
9 Showerthoughts Russia vs USA

Results

Comparing models trained on train, hyperparameters tuned on development and evaluated on test:

  • baseline model with L2 regularisation, C10
  • fastText model with epoch 90, dimensions=50
  • fastText-quantized model as above but quantized
In [13]:
df
Out[13]:
name time size p r f
0 baseline 21.397510 176.164019 0.692 0.692 0.692
1 fastText 6.789109 64.886967 0.689 0.689 0.689
2 fastText-quantized 54.206348 3.623718 0.702 0.702 0.702

Size vs Time

  • Left-to-right: baseline, fastText, fastText-quantized
  • fastText smaller and quicker to train than baseline
  • fastText-quantized smaller again, but much slower to train
In [15]:
plot(df)

Performance

  • very similar performance
  • quantisation makes little impact on F1 score

$\rightarrow$ significant difference, probably not...

In [16]:
df.plot(kind='bar', x='name', y=['p', 'r', 'f'], ylim=(0, 1), figsize=(20, 8))
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x11035d8d0>

What I didn't look at here

  • Sub-word embeddings for performance on unknown words.
  • Not using hierarchical softmax to give slower, but better models.
  • Pre-initialising with other word embeddings.

Conclusions

fastText is worth trying.

  • Works relatively well for text (see also *Sem "embed all the things")
  • Can have small model sizes (for mobile or serverless platforms)
  • Trains quickly enough to tune and spend time getting more data

Thanks!