Concerted programme of research from Facebook towards practical text classification. Three key papers from 2016:
d
-dimension word embeddings (à la word2vec) smushed together with averaging.$\rightarrow$ Faster and hopefully better than bag-of-words
>>> tokens('the cat sat on the mat')
['the', 'cat', 'sat', 'on', 'the', 'mat']
>>> ngrams('the cat sat on the mat')
['START_the', 'the_cat', 'cat_sat', 'sat_on', 'on_the', 'the_mat', 'mat_END']
But:
sat_on
to feature_12345
, which costs memory and slows multicore learningThe hashing trick (see also vowpal wabbit maps n-grams into a constrained memory space
$\rightarrow$ More memory-efficient and parallelisable learning
$\rightarrow$ Faster with some accuracy cost
$\rightarrow$ Smaller models at some accuracy cost
$\rightarrow$ More robust models
json
title
The title of the submissionsubreddit
The subreddit it was submitted to$\rightarrow$ Given the title
, predict which subreddit
it belongs to.
really dumb tokeniser to split words !
other
), and for training embeddingsfastText
using some pythonscikit-learn
, nothing fancydata_stats = pd.DataFrame([{'Label': k, 'Count': v} for k, v in
Counter((i[0] for i in train)).items()]).sort_values('Count', ascending=False)
data_stats[:10]
Count | Label | |
---|---|---|
16 | 1700 | AutoNewspaper |
11 | 1248 | AskReddit |
1 | 1085 | GlobalOffensiveTrade |
6 | 740 | news |
4 | 691 | RocketLeagueExchange |
7 | 637 | newsbotbot |
17 | 473 | The_Donald |
5 | 472 | business |
15 | 459 | videos |
8 | 389 | Showerthoughts |
data_stats[10:]
Count | Label | |
---|---|---|
2 | 349 | funny |
3 | 278 | me_irl |
9 | 272 | gameofthrones |
14 | 262 | gaming |
13 | 225 | aww |
10 | 207 | pics |
18 | 172 | Overwatch |
19 | 151 | MemezForDayz |
12 | 140 | gonewild |
20 | 49 | Ice_Poseidon |
0 | 1 | ShittyCarMod |
data_df = pd.DataFrame([{'Text': t, 'Label': l} for l, t in train[:10]])
data_df
Label | Text | |
---|---|---|
0 | ShittyCarMod | Maserati with an Aerodynamic Spoiler |
1 | GlobalOffensiveTrade | [Store] Double moses M9 #355, Dual side blue g... |
2 | funny | Adam Mišík – Tak pojď |
3 | me_irl | me 🌞irl |
4 | RocketLeagueExchange | [ps4] [h] crimsom mantis [w] offers key prefer |
5 | business | Edible Prints on Cake |
6 | funny | RT @naukhaizs: #pakistan #karachi #lahore #isl... |
7 | news | 10 ảnh avatar buồn , thất tình , tâm trạng dàn... |
8 | newsbotbot | @washingtonpost: For Facebook , erasing hate s... |
9 | Showerthoughts | Russia vs USA |
Comparing models trained on train
, hyperparameters tuned on development
and evaluated on test
:
baseline
model with L2 regularisation, C10fastText
model with epoch 90, dimensions=50fastText-quantized
model as above but quantizeddf
name | time | size | p | r | f | |
---|---|---|---|---|---|---|
0 | baseline | 21.397510 | 176.164019 | 0.692 | 0.692 | 0.692 |
1 | fastText | 6.789109 | 64.886967 | 0.689 | 0.689 | 0.689 |
2 | fastText-quantized | 54.206348 | 3.623718 | 0.702 | 0.702 | 0.702 |
baseline
, fastText
, fastText-quantized
fastText
smaller and quicker to train than baseline
fastText-quantized
smaller again, but much slower to trainplot(df)
$\rightarrow$ significant difference, probably not...
df.plot(kind='bar', x='name', y=['p', 'r', 'f'], ylim=(0, 1), figsize=(20, 8))
<matplotlib.axes._subplots.AxesSubplot at 0x11035d8d0>
fastText
is worth trying.
*Sem
"embed all the things")