Item request has been placed!
×
Item request cannot be made.
×
Processing Request
Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings.
Item request has been placed!
×
Item request cannot be made.
×
Processing Request
- Additional Information
- Abstract:
Motivation In recent years, pre-training with the transformer architecture has gained significant attention. While this approach has led to notable performance improvements across a variety of downstream tasks, the underlying mechanisms by which pre-training models influence these tasks, particularly in the context of biological data, are not yet fully elucidated. Results In this study, focusing on the pre-training on nucleotide sequences, we decompose a pre-training model of Bidirectional Encoder Representations from Transformers (BERT) into its embedding and encoding modules to analyze what a pre-trained model learns from nucleotide sequences. Through a comparative study of non-standard pre-training at both the data and model levels, we find that a typical BERT model learns to capture overlapping-consistent k-mer embeddings for its token representation within its embedding module. Interestingly, using the k-mer embeddings pre-trained on random data can yield similar performance in downstream tasks, when compared with those using the k-mer embeddings pre-trained on real biological sequences. We further compare the learned k-mer embeddings with other established k-mer representations in downstream tasks of sequence-based functional prediction. Our experimental results demonstrate that the dense representation of k-mers learned from pre-training can be used as a viable alternative to one-hot encoding for representing nucleotide sequences. Furthermore, integrating the pre-trained k-mer embeddings with simpler models can achieve competitive performance in two typical downstream tasks. Availability and implementation The source code and associated data can be accessed at https://github.com/yaozhong/bert_investigation. [ABSTRACT FROM AUTHOR]
- Abstract:
Copyright of Bioinformatics is the property of Oxford University Press / USA and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
No Comments.