Using fastText to classify reddit submissions¶

A quick and dirty tutorial¶

Will Radford - @wejradford

fastText¶

Concerted programme of research from Facebook towards practical text classification. Three key papers from 2016:

Open source at GitHub and elsewhere

How does it work?¶

title

Lowest layer maps n-grams to d-dimension word embeddings (à la word2vec) smushed together with averaging.
Hidden layer maps each dimension to a weight.
Top layer is the label.

Mixing neural and shallow models¶

Word embeddings for a task-specific representation
Linear models for fast learning and prediction

$\rightarrow$ Faster and hopefully better than bag-of-words

Dealing with large vocabulary¶

Word n-grams give more context than words alone

>>> tokens('the cat sat on the mat')
['the', 'cat', 'sat', 'on', 'the', 'mat']
>>> ngrams('the cat sat on the mat')
['START_the', 'the_cat', 'cat_sat', 'sat_on', 'on_the', 'the_mat', 'mat_END']

But:
- Much sparser features (more zeros than non-zeros)
- Need to maintain the mappings from sat_on to feature_12345, which costs memory and slows multicore learning
The hashing trick (see also vowpal wabbit maps n-grams into a constrained memory space

$\rightarrow$ More memory-efficient and parallelisable learning

Classifying in large label spaces¶

Linear classifiers run at $O(fk)$ where:
- $f$ the number of features (i.e. dimensions)
- $k$ is the number of labels
Hierarchical softmax reduces this to $O(f\log_{2}k)$ by building a tree of labels arranged by probability.

$\rightarrow$ Faster with some accuracy cost

Compressing models¶

Large models limit the deployment environment and increase load time
Vector quantisation lossily compresses the model

$\rightarrow$ Smaller models at some accuracy cost

Handling unseen words¶

Bag-of-words limits the model to handling those words seen in training (maybe not for hashing...).
Learn character n-grams embeddings to handle morphology (e.g. "make", "making", "mako") and unknown words.
Pre-initialising word embeddings with those learned from different corpora.

$\rightarrow$ More robust models

A toy task¶

Reddit submissions collected at http://files.pushshift.io/reddit/submissions/
Downloadable as json
- title The title of the submission
- subreddit The subreddit it was submitted to

$\rightarrow$ Given the title, predict which subreddit it belongs to.

Preprocessing¶

Take each post
Find the first sentence
Apply a really dumb tokeniser to split words !
Use the subreddit topic as the label

Streaming through reddit submissions¶

First 50K used to find the top-20 common labels (plus other), and for training embeddings
Next 10K for training
Next 10K for development
Next 1K for test

Training the model¶

Wrap command-line fastText using some python
Call and predict, returning:
- Precision
- Recall
- F1
- Training time
- Model size

Sanity-check baseline¶

Using scikit-learn, nothing fancy
Bag of words (unigrams and bigrams)
Logistic regression prediction

Data¶

In [31]:

data_stats = pd.DataFrame([{'Label': k, 'Count': v} for k, v in 
              Counter((i[0] for i in train)).items()]).sort_values('Count', ascending=False)
data_stats[:10]

Out[31]:

	Count	Label
16	1700	AutoNewspaper
11	1248	AskReddit
1	1085	GlobalOffensiveTrade
6	740	news
4	691	RocketLeagueExchange
7	637	newsbotbot
17	473	The_Donald
5	472	business
15	459	videos
8	389	Showerthoughts

In [32]:

data_stats[10:]

Out[32]:

	Count	Label
2	349	funny
3	278	me_irl
9	272	gameofthrones
14	262	gaming
13	225	aww
10	207	pics
18	172	Overwatch
19	151	MemezForDayz
12	140	gonewild
20	49	Ice_Poseidon
0	1	ShittyCarMod

In [33]:

data_df = pd.DataFrame([{'Text': t, 'Label': l} for l, t in train[:10]])
data_df

Out[33]:

	Label	Text
0	ShittyCarMod	Maserati with an Aerodynamic Spoiler
1	GlobalOffensiveTrade	[Store] Double moses M9 #355, Dual side blue g...
2	funny	Adam Mišík – Tak pojď
3	me_irl	me 🌞irl
4	RocketLeagueExchange	[ps4] [h] crimsom mantis [w] offers key prefer
5	business	Edible Prints on Cake
6	funny	RT @naukhaizs: #pakistan #karachi #lahore #isl...
7	news	10 ảnh avatar buồn , thất tình , tâm trạng dàn...
8	newsbotbot	@washingtonpost: For Facebook , erasing hate s...
9	Showerthoughts	Russia vs USA

Results¶

Comparing models trained on train, hyperparameters tuned on development and evaluated on test:

baseline model with L2 regularisation, C10
fastText model with epoch 90, dimensions=50
fastText-quantized model as above but quantized

In [13]:

df

Out[13]:

	name	time	size	p	r	f
0	baseline	21.397510	176.164019	0.692	0.692	0.692
1	fastText	6.789109	64.886967	0.689	0.689	0.689
2	fastText-quantized	54.206348	3.623718	0.702	0.702	0.702

Size vs Time¶

Left-to-right: baseline, fastText, fastText-quantized
fastText smaller and quicker to train than baseline
fastText-quantized smaller again, but much slower to train

In [15]:

plot(df)

Performance¶

very similar performance
quantisation makes little impact on F1 score

$\rightarrow$ significant difference, probably not...

In [16]:

df.plot(kind='bar', x='name', y=['p', 'r', 'f'], ylim=(0, 1), figsize=(20, 8))

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x11035d8d0>

What I didn't look at here¶

Sub-word embeddings for performance on unknown words.
Not using hierarchical softmax to give slower, but better models.
Pre-initialising with other word embeddings.

Conclusions¶

fastText is worth trying.

Works relatively well for text (see also *Sem "embed all the things")
Can have small model sizes (for mobile or serverless platforms)
Trains quickly enough to tune and spend time getting more data