Concerted programme of research from Facebook towards practical text classification. Three key papers from 2016:

d-dimension word embeddings (à la word2vec) smushed together with averaging.$\rightarrow$ Faster and hopefully better than bag-of-words
>>> tokens('the cat sat on the mat')
['the', 'cat', 'sat', 'on', 'the', 'mat']
>>> ngrams('the cat sat on the mat')
['START_the', 'the_cat', 'cat_sat', 'sat_on', 'on_the', 'the_mat', 'mat_END']
But:
sat_on to feature_12345, which costs memory and slows multicore learningThe hashing trick (see also vowpal wabbit maps n-grams into a constrained memory space
$\rightarrow$ More memory-efficient and parallelisable learning
$\rightarrow$ Faster with some accuracy cost
$\rightarrow$ Smaller models at some accuracy cost
$\rightarrow$ More robust models
jsontitle The title of the submissionsubreddit The subreddit it was submitted to$\rightarrow$ Given the title, predict which subreddit it belongs to.
really dumb tokeniser to split words !other), and for training embeddingsfastText using some pythonscikit-learn, nothing fancydata_stats = pd.DataFrame([{'Label': k, 'Count': v} for k, v in
Counter((i[0] for i in train)).items()]).sort_values('Count', ascending=False)
data_stats[:10]
| Count | Label | |
|---|---|---|
| 16 | 1700 | AutoNewspaper |
| 11 | 1248 | AskReddit |
| 1 | 1085 | GlobalOffensiveTrade |
| 6 | 740 | news |
| 4 | 691 | RocketLeagueExchange |
| 7 | 637 | newsbotbot |
| 17 | 473 | The_Donald |
| 5 | 472 | business |
| 15 | 459 | videos |
| 8 | 389 | Showerthoughts |
data_stats[10:]
| Count | Label | |
|---|---|---|
| 2 | 349 | funny |
| 3 | 278 | me_irl |
| 9 | 272 | gameofthrones |
| 14 | 262 | gaming |
| 13 | 225 | aww |
| 10 | 207 | pics |
| 18 | 172 | Overwatch |
| 19 | 151 | MemezForDayz |
| 12 | 140 | gonewild |
| 20 | 49 | Ice_Poseidon |
| 0 | 1 | ShittyCarMod |
data_df = pd.DataFrame([{'Text': t, 'Label': l} for l, t in train[:10]])
data_df
| Label | Text | |
|---|---|---|
| 0 | ShittyCarMod | Maserati with an Aerodynamic Spoiler |
| 1 | GlobalOffensiveTrade | [Store] Double moses M9 #355, Dual side blue g... |
| 2 | funny | Adam Mišík – Tak pojď |
| 3 | me_irl | me 🌞irl |
| 4 | RocketLeagueExchange | [ps4] [h] crimsom mantis [w] offers key prefer |
| 5 | business | Edible Prints on Cake |
| 6 | funny | RT @naukhaizs: #pakistan #karachi #lahore #isl... |
| 7 | news | 10 ảnh avatar buồn , thất tình , tâm trạng dàn... |
| 8 | newsbotbot | @washingtonpost: For Facebook , erasing hate s... |
| 9 | Showerthoughts | Russia vs USA |
Comparing models trained on train, hyperparameters tuned on development and evaluated on test:
baseline model with L2 regularisation, C10fastText model with epoch 90, dimensions=50fastText-quantized model as above but quantizeddf
| name | time | size | p | r | f | |
|---|---|---|---|---|---|---|
| 0 | baseline | 21.397510 | 176.164019 | 0.692 | 0.692 | 0.692 |
| 1 | fastText | 6.789109 | 64.886967 | 0.689 | 0.689 | 0.689 |
| 2 | fastText-quantized | 54.206348 | 3.623718 | 0.702 | 0.702 | 0.702 |
baseline, fastText, fastText-quantizedfastText smaller and quicker to train than baselinefastText-quantized smaller again, but much slower to trainplot(df)
$\rightarrow$ significant difference, probably not...
df.plot(kind='bar', x='name', y=['p', 'r', 'f'], ylim=(0, 1), figsize=(20, 8))
<matplotlib.axes._subplots.AxesSubplot at 0x11035d8d0>
fastText is worth trying.
*Sem "embed all the things")