Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. 2019 Sep 26.
Summary
ALBERT modifies [BERT] by sharing parameters between layers and factorizing the embedding layer into two smaller matrices, as well as trading next sentence prediction (NSP) for sentence order prediction (SOP) due to findings from Yang et al 2019 (XLNET) and Liu et al 2019 (RoBERTa).
Model Name. | Parameters | Layers | Hidden | Embedding |
---|---|---|---|---|
ALBERT-base | 12M | 12 | 768 | 128 |
ALBERT-large | 18M | 24 | 1024 | 128 |
ALBERT-xlarge | 60M | 24 | 2048 | 128 |
ALBERT-xxlarge | 235M | 12 | 4096 | 128 |
Note that, due to parameter sharing in ALBERT, the number of parameters is separate from the effective computation cost; it affects only memory requirements.
Formula used to calculate compute parameters:
\[V \times E + E \times H + L \times (4 \times H^2 + 2 \times 4H^2) + H^2\]Where $V$, $E$, $H$ and $L$ represent the Vocab size, Embedding size, Hidden size and Layer number respectively.
Model | Compute Parameters | Pretraining steps | Tokens per batch | Normalized computation cost (Parameters x steps x tokens per batch) |
---|---|---|---|---|
ALBERT-base | $89 \times 10^6$ | $125,000$ | $4096 \times 512$ | $2.333 \times 10^{19}$ |
ALBERT-large | $307 \times 10^6$ | $125,000$ | $4096 \times 512$ | $8.048 \times 10^{19}$ |
ALBERT-xlarge | $1216 \times 10^6$ | $125,000$ | $4096 \times 512$ | $3.188 \times 10^{20}$ |
ALBERT-xxlarge | $2437 \times 10^6$ | $125,000$ | $4096 \times 512$ | $6.388 \times 10^{20}$ |
Architecture Details
Tokenization
Sentencepiece, vocab size 30000.
Embeddings for multi-segment tasks
Standard
Positional Encoding
Learnable Absolute Positional Embeddings.
Attention Mechanism
Multi-head scaled dot-product attention.
Training Details
Pre-training Data
Wikipedia and BookCorpus.
As well as “The additional data used by XLNET and RoBERTa”
Pre-training Method
Masked Language Modelling: N-gram masking, maximum n-gram of three. Authors cite this to Joshi et al 2019.
- Choose 15% of tokens to mask according to N-gram masking distribution for contiguous tokens.
- Of these chosen token blocks, replace 80% with special token [MASK], 10% with a random token from vocabulary, and 10% with the same token.
- Train model to predict those tokens based on sentence presented to them via cross-entropy loss.
Sentence Order Prediction: Reject NSP; replace with sentence order prediction.
- Take two sentences next to each other
- Make model predict their order
Finetuning Data
SQuAD1.1, SQuAD2.0, MNLI, SST-2, RACE
Finetuning Method
Hyperparameters:
Dataset | Learning Rate | Batch Size | ALBERT Dropout Rate | Classifier Dropout Rate | Training Steps | Warmup Steps | Maximum Sequence Length |
---|---|---|---|---|---|---|---|
CoLA | $1 \times 10^{-5}$ | 16 | 0 | 0.1 | 5336 | 320 | 512 |
STS | $2 \times 10^{-5}$ | 16 | 0 | 0.1 | 3598 | 214 | 512 |
SST-2 | $1 \times 10^{-5}$ | 32 | 0 | 0.1 | 20935 | 1256 | 512 |
MNLI | $3 \times 10^{-5}$ | 128 | 0 | 0.1 | 10000 | 1000 | 512 |
QNLI | $1 \times 10^{-5}$ | 32 | 0 | 0.1 | 33112 | 1986 | 512 |
QQP | $5 \times 10^{-5}$ | 128 | 0.1 | 0.1 | 14000 | 1000 | 512 |
RTE | $3 \times 10^{-5}$ | 32 | 0.1 | 0.1 | 800 | 200 | 512 |
MRPC | $2 \times 10^{-5}$ | 32 | 0 | 0.1 | 800 | 200 | 512 |
WNLI | $2 \times 10^{-5}$ | 16 | 0.1 | 0.1 | 2000 | 250 | 512 |
SQuAD v1.1 | $5 \times 10^{-5}$ | 48 | 0 | 0.1 | 3649 | 365 | 384 |
SQuAD v2.0 | $3 \times 10^{-5}$ | 48 | 0 | 0.1 | 8144 | 814 | 512 |
RACE | $2 \times 10^{-5}$ | 32 | 0.1 | 0.1 | 12000 | 1000 | 512 |
Note there is a large range of hyperparameters that have been tested and optimised here. Reportedly these hyperparameters were inspired by BERT, XLNET and RoBERTa hyperparameters, but the overlap is not clear. How much computation was put into hyperparameter search?
Evaluation
Report Median Result over five runs.
Evaluation Datasets
SQuAD1.1, SQuAD2.0, MNLI, SST-2, RACE
Evaluation Results
Model Name | Average | SQuAD1.1 | SQuAD2.0 | MNLI | SST-2 | RACE |
---|---|---|---|---|---|---|
ALBERT-base | 80.1 | 89.3/82.3 | 80.0/77.1 | 81.6 | 90.3 | 64.0 |
ALBERT-large | 82.4 | 90.6/83.9 | 82.3/79.4 | 83.5 | 91.7 | 68.5 |
ALBERT-xlarge | 85.5 | 92.5/86.1 | 86.1/83.1 | 86.4 | 92.4 | 74.8 |
ALBERT-xxlarge | 91.0 | 94.8/89.3 | 90.2/87.4 | 90.8 | 96.9 | 86.5 |
Author’s Conclusions
- Although ALBERT has less parameters than BERT, it can easily be more computationally expensive.
- Sparse Attention (Child et al 2019) and Block Attention (Shen et al 2018) should be investigated.
- Try efficient language modelling (Yang et al 2019) and Hard Example Mining (Mikolov et al 2013).
- Sentence order prediction is alright, but there may be many more self-supervised methods to try out.