(ALBERT) ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

ALBERT introduces changes to BERT to lower the number of parameters.

Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. 2019 Sep 26.

Summary

ALBERT modifies [BERT] by sharing parameters between layers and factorizing the embedding layer into two smaller matrices, as well as trading next sentence prediction (NSP) for sentence order prediction (SOP) due to findings from Yang et al 2019 (XLNET) and Liu et al 2019 (RoBERTa).

Model Name.	Parameters	Layers	Hidden	Embedding
ALBERT-base	12M	12	768	128
ALBERT-large	18M	24	1024	128
ALBERT-xlarge	60M	24	2048	128
ALBERT-xxlarge	235M	12	4096	128

Note that, due to parameter sharing in ALBERT, the number of parameters is separate from the effective computation cost; it affects only memory requirements.

Formula used to calculate compute parameters:

\[V \times E + E \times H + L \times (4 \times H^2 + 2 \times 4H^2) + H^2\]

Where $V$, $E$, $H$ and $L$ represent the Vocab size, Embedding size, Hidden size and Layer number respectively.

Model	Compute Parameters	Pretraining steps	Tokens per batch	Normalized computation cost (Parameters x steps x tokens per batch)
ALBERT-base	$89 \times 10^6$	$125,000$	$4096 \times 512$	$2.333 \times 10^{19}$
ALBERT-large	$307 \times 10^6$	$125,000$	$4096 \times 512$	$8.048 \times 10^{19}$
ALBERT-xlarge	$1216 \times 10^6$	$125,000$	$4096 \times 512$	$3.188 \times 10^{20}$
ALBERT-xxlarge	$2437 \times 10^6$	$125,000$	$4096 \times 512$	$6.388 \times 10^{20}$

Architecture Details

Tokenization

Sentencepiece, vocab size 30000.

Embeddings for multi-segment tasks

Standard

Positional Encoding

Learnable Absolute Positional Embeddings.

Attention Mechanism

Multi-head scaled dot-product attention.

Training Details

Pre-training Data

Wikipedia and BookCorpus.

As well as “The additional data used by XLNET and RoBERTa”

Pre-training Method

Masked Language Modelling: N-gram masking, maximum n-gram of three. Authors cite this to Joshi et al 2019.

Choose 15% of tokens to mask according to N-gram masking distribution for contiguous tokens.
Of these chosen token blocks, replace 80% with special token [MASK], 10% with a random token from vocabulary, and 10% with the same token.
Train model to predict those tokens based on sentence presented to them via cross-entropy loss.

Sentence Order Prediction: Reject NSP; replace with sentence order prediction.

Take two sentences next to each other
Make model predict their order

Finetuning Data

SQuAD1.1, SQuAD2.0, MNLI, SST-2, RACE

Finetuning Method

Hyperparameters:

Dataset	Learning Rate	Batch Size	ALBERT Dropout Rate	Classifier Dropout Rate	Training Steps	Warmup Steps	Maximum Sequence Length
CoLA	$1 \times 10^{-5}$	16	0	0.1	5336	320	512
STS	$2 \times 10^{-5}$	16	0	0.1	3598	214	512
SST-2	$1 \times 10^{-5}$	32	0	0.1	20935	1256	512
MNLI	$3 \times 10^{-5}$	128	0	0.1	10000	1000	512
QNLI	$1 \times 10^{-5}$	32	0	0.1	33112	1986	512
QQP	$5 \times 10^{-5}$	128	0.1	0.1	14000	1000	512
RTE	$3 \times 10^{-5}$	32	0.1	0.1	800	200	512
MRPC	$2 \times 10^{-5}$	32	0	0.1	800	200	512
WNLI	$2 \times 10^{-5}$	16	0.1	0.1	2000	250	512
SQuAD v1.1	$5 \times 10^{-5}$	48	0	0.1	3649	365	384
SQuAD v2.0	$3 \times 10^{-5}$	48	0	0.1	8144	814	512
RACE	$2 \times 10^{-5}$	32	0.1	0.1	12000	1000	512

Note there is a large range of hyperparameters that have been tested and optimised here. Reportedly these hyperparameters were inspired by BERT, XLNET and RoBERTa hyperparameters, but the overlap is not clear. How much computation was put into hyperparameter search?

Evaluation

Report Median Result over five runs.

Evaluation Datasets

SQuAD1.1, SQuAD2.0, MNLI, SST-2, RACE

Evaluation Results

Model Name	Average	SQuAD1.1	SQuAD2.0	MNLI	SST-2	RACE
ALBERT-base	80.1	89.3/82.3	80.0/77.1	81.6	90.3	64.0
ALBERT-large	82.4	90.6/83.9	82.3/79.4	83.5	91.7	68.5
ALBERT-xlarge	85.5	92.5/86.1	86.1/83.1	86.4	92.4	74.8
ALBERT-xxlarge	91.0	94.8/89.3	90.2/87.4	90.8	96.9	86.5

Author’s Conclusions

Although ALBERT has less parameters than BERT, it can easily be more computationally expensive.
Sparse Attention (Child et al 2019) and Block Attention (Shen et al 2018) should be investigated.
Try efficient language modelling (Yang et al 2019) and Hard Example Mining (Mikolov et al 2013).
Sentence order prediction is alright, but there may be many more self-supervised methods to try out.

Tags: