Paper finds that Next Sentence Prediction (NSP) from [BERT] does not necessarily improve performance.
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. 2019 Jun 19.