MycoAI: Fast and accurate taxonomic classification for fungal ITS sequences.

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Additional Information
    • Source:
      Publisher: Blackwell Country of Publication: England NLM ID: 101465604 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1755-0998 (Electronic) Linking ISSN: 1755098X NLM ISO Abbreviation: Mol Ecol Resour Subsets: MEDLINE
    • Publication Information:
      Original Publication: Oxford, England : Blackwell
    • Subject Terms:
    • Abstract:
      Efficient and accurate classification of DNA barcode data is crucial for large-scale fungal biodiversity studies. However, existing methods are either computationally expensive or lack accuracy. Previous research has demonstrated the potential of deep learning in this domain, successfully training neural networks for biological sequence classification. We introduce the MycoAI Python package, featuring various deep learning models such as BERT and CNN tailored for fungal Internal Transcribed Spacer (ITS) sequences. We explore different neural architecture designs and encoding methods to identify optimal models. By employing a multi-head output architecture and multi-level hierarchical label smoothing, MycoAI effectively generalizes across the taxonomic hierarchy. Using over 5 million labelled sequences from the UNITE database, we develop two models: MycoAI-BERT and MycoAI-CNN. While we emphasize the necessity of verifying classification results by AI models due to insufficient reference data, MycoAI still exhibits substantial potential. When benchmarked against existing classifiers such as DNABarcoder and RDP on two independent test sets with labels present in the training dataset, MycoAI models demonstrate high accuracy at the genus and higher taxonomic levels, with MycoAI-CNN being the fastest and most accurate. In terms of efficiency, MycoAI models can classify over 300,000 sequences within 5 min. We publicly release the MycoAI models, enabling mycologists to classify their ITS barcode data efficiently. Additionally, MycoAI serves as a platform for developing further deep learning-based classification methods. The source code for MycoAI is available under the MIT Licence at https://github.com/MycoAI/MycoAI.
      (© 2024 The Author(s). Molecular Ecology Resources published by John Wiley & Sons Ltd.)
    • References:
      Abarenkov, K., Zirk, A., Piirmann, T., Pöhönen, R., Ivanov, F., Nilsson, R. H., & Kõljalg, U. (2023). Full unite+insd dataset for fungi. UNITE Community. https://doi.org/10.15156/BIO/2938065.
      Ahn, S.‐Y., & Lee, S.‐W. (2023, August). BERT‐based classification of fungi protein sequences with multiple GO labels. In Proceedings of the International Conference on research in adaptive and convergent systems. ACM. https://doi.org/10.1145/3599957.3606249.
      Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990, October). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410. https://doi.org/10.1016/s0022‐2836(05)80360‐2.
      Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv. https://doi.org/10.48550/ARXIV.1409.0473.
      Baldrian, P., Větrovský, T., Lepinay, C., & Kohout, P. (2021, February). High‐throughput sequencing view on the magnitude of global fungal diversity. Fungal Diversity, 114(1), 539–547. https://doi.org/10.1007/s13225‐021‐00472‐y.
      Bosco, G. L., & Gangi, M. A. D. (2017). Deep learning architectures for DNA sequence classification. In: A. Petrosino, V. Loia, & W. Pedrycz (Eds.), Fuzzy logic and soft computing applications (pp. 162–171). Springer International Publishing. https://doi.org/10.1007/978‐3‐319‐52962‐2_14.
      Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert‐Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few‐shot learners. arXiv. https://doi.org/10.48550/ARXIV.2005.14165.
      Busia, A., Dahl, G. E., Fannjiang, C., Alexander, D. H., Dorfman, E., Poplin, R., McLean, C. Y., Chang, P. C., & DePristo, M. (2018, June). A deep learning approach to pattern recognition for short DNA sequences. bioRxiv. https://doi.org/10.1101/353474.
      Cole, J. R., Wang, Q., Fish, J. A., Chai, B., McGarrell, D. M., Sun, Y., Brown, C. T., Porras‐Alfaro, A., Kuske, C. R., & Tiedje, J. M. (2013, November). Ribosomal database project: Data and tools for high throughput rRNA analysis. Nucleic Acids Research, 42(D1), D633–D642. https://doi.org/10.1093/nar/gkt1244.
      Dalla‐Torre, H., Gonzalez, L., Mendoza‐Revilla, J., Carranza, N. L., Grzywaczewski, A. H., Oteri, F., Dallago, C., Trop, E., de Almeida, B. P., Sirelkhatim, H., Richard, G., Skwark, M., Beguir, K., Lopez, M., & Pierrot, T. (2023, January). The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv. https://doi.org/10.1101/2023.01.11.523679.
      Devlin, J., Chang, M.‐W., Lee, K., & Toutanova, K. (2018). Bert: Pre‐training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/ARXIV.1810.04805.
      Fiannaca, A., La Paglia, L., La Rosa, M., Lo Bosco, G., Renda, G., Rizzo, R., Gaglio, S., & Urso, A. (2018, July). Deep learning models for bacteria taxonomic classification of metagenomic data. BMC Bioinformatics, 19(S7), 198. https://doi.org/10.1186/s12859‐018‐2182‐6.
      Ficetola, G. F., Miaud, C., Pompanon, F., & Taberlet, P. (2008). Species detection using environmental dna from water samples. Biology Letters, 4(4), 423–425. https://doi.org/10.1098/rsbl.2008.0118.
      Flück, B., Mathon, L., Manel, S., Valentini, A., Dejean, T., Albouy, C., Mouillot, D., Thuiller, W., Murienne, J., Brosse, S., & Pellissier, L. (2021, May). Fast processing of environmental DNA metabarcoding sequence data using convolutional neural networks. bioRxiv. https://doi.org/10.1101/2021.05.22.445213.
      Hawksworth, D. L., & Lücking, R. (2017, September). Fungal diversity revisited: 2.2 to 3.8 million species. In J. Heitman, B. J. Howlett, P. W. Crous, E. H. Stukenbrock, T. Y. James & N. A. R. Gow (Eds.), The fungal kingdom (pp. 79–95). ASM Press. https://doi.org/10.1128/9781555819583.ch4.
      He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. arXiv. https://doi.org/10.48550/ARXIV.1512.03385.
      Hebert, P. D. N., Cywinska, A., Ball, S. L., & deWaard, J. R. (2003). Biological identifications through dna barcodes. Proceedings of the Royal Society of London. Series B: Biological Sciences, 270(1512), 313–321. https://doi.org/10.1098/rspb.2002.2218.
      Helaly, M. A., Rady, S., & Aref, M. M. (2019, October). Convolutional neural networks for biological sequence taxonomic classification: A comparative study. In: A. Hassanien, K. Shaalan, M. Tolba (Eds.), Proceedings of the international conference on advances in intelligent systems and computing (pp. 523–533). Springer International Publishing. https://doi.org/10.1007/978‐3‐030‐31129‐2_48.
      Ji, Y., Zhou, Z., Liu, H., & Davuluri, R. V. (2021, February). DNABERT: Pre‐trained bidirectional encoder representations from transformers model for DNA‐language in genome. Bioinformatics, 37(15), 2112–2120. https://doi.org/10.1093/bioinformatics/btab083.
      Karollus, A., Hingerl, J., Gankin, D., Grosshauser, M., Klemon, K., & Gagneur, J. (2023, January). Species‐aware DNA language models capture regulatory elements and their evolution. Genome biology, 25(1), 83. https://doi.org/10.1101/2023.01.26.525670.
      Kõljalg, U., Larsson, K.‐H., Abarenkov, K., Nilsson, R. H., Alexander, I. J., Eberhardt, U., Erland, S., Høiland, K., Kjøller, R., Larsson, E., Pennanen, T., Sen, R., Taylor, A. F. S., Tedersoo, L., Vrålstad, T., & Ursing, B. M. (2005, March). UNITE: A database providing web‐based methods for the molecular identification of ectomycorrhizal fungi. New Phytologist, 166(3), 1063–1068. https://doi.org/10.1111/j.1469‐8137.2005.01376.x.
      Kõljalg, U., Nilsson, H. R., Schigel, D., Tedersoo, L., Larsson, K.‐H., May, T. W., Taylor, A. F. S., Jeppesen, T. S., Frøslev, T. G., Lindahl, B. D., Põldmaa, K., Saar, I., Suija, A., Savchenko, A., Yatsiuk, I., Adojaan, K., Ivanov, F., Piirmann, T., Pöhönen, R., … Abarenkov, K. (2020, November). The taxon hypothesis paradigm—On the unambiguous detection and communication of taxa. Microorganisms, 8(12), 1910. https://doi.org/10.3390/microorganisms8121910.
      Kudo, T., & Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizerand detokenizer for neural text processing. arXiv. https://doi.org/10.48550/ARXIV.1808.06226.
      Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient‐based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791.
      Leslie, C., Eskin, E., & Noble, W. S. (2002). The spectrum kernel: A string kernel for svm protein classification. In Pacific symposium on biocomputing (Vol. 7). Association for Computational Linguistics.
      Liang, Q., Bible, P. W., Liu, Y., Zou, B., & Wei, L. (2020, February). DeepMicrobes: Taxonomic classification for metagenomics with deep learning. NAR Genomics and Bioinformatics, 2(1). https://doi.org/10.1093/nargab/lqaa009.
      Lin, T.‐Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. arXiv. https://doi.org/10.48550/ARXIV.1708.02002.
      Lucking, R., Aime, M. C., Robbertse, B., Miller, A. N., Aoki, T., Ariyawansa, H. A., Cardinali, G., Crous, P. W., Druzhinina, I. S., Geiser, D. M., Hawksworth, D. L., Hyde, K. D., Irinyi, L., Jeewon, R., Johnston, P. R., Kirk, P. M., Malosso, E., May, T. W., & Schoch, C. L. (2021, April). Fungal taxonomy and sequence‐based nomenclature. Nature Microbiology, 6(5), 540–548. https://doi.org/10.1038/s41564‐021‐00888‐x.
      Mock, F., Kretschmer, F., Kriese, A., Böcker, S., & Marz, M. (2022, August). Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks. Proceedings of the National Academy of Sciences, 119(35), e2122636119. https://doi.org/10.1073/pnas.2122636119.
      Nguyen, N. G., Tran, V. A., Ngo, D. L., Phan, D., Lumbanraja, F. R., Faisal, M. R., Abapihi, B., Kubo, M., & Satou, K. (2016). DNA sequence classification by convolutional neural network. Journal of Biomedical Science and Engineering, 9(5), 280–286. https://doi.org/10.4236/jbise.2016.95021.
      Nilsson, R. H., Ryberg, M., Kristiansson, E., Abarenkov, K., Larsson, K.‐H., & Kõljalg, U. (2006, 12). Taxonomic reliability of dna sequences in public sequence databases: A fungal perspective. PLoS ONE, 1(1), 1–4. https://doi.org/10.1371/journal.pone.0000059.
      Nilsson, R. H., Wurzbacher, C., Bahram, M. R. M., Coimbra, V., Larsson, E., Tedersoo, L., Eriksson, J., Ritter, C. D., Svantesson, S., Sánchez‐García, M., Ryberg, M., & Abarenkov, K. (2016, March). Top 50 most wanted fungi. MycoKeys, 12, 29–40. https://doi.org/10.3897/mycokeys.12.7553.
      Niskanen, T., Lücking, R., Dahlberg, A., Gaya, E., Suz, L. M., Mikryukov, V., Liimatainen, K., Druzhinina, I., Westrip, J. R. S., Mueller, G. M., Martins‐Cunha, K., Kirk, P., Tedersoo, L., & Antonelli, A. (2023, September). Pushing the frontiers of biodiversity research: Unveiling the global diversity, distribution, and conservation of fungi. Annual Review of Environment and Resources, 48, 149–176. https://doi.org/10.1146/annurev‐environ‐112621‐090937.
      Pham, T. V., Nguyen, V. V., Vu, D., Henneman, A. A., Richardson, R. A., Piersma, S. R., & Jimenez, C. R. (2023, March). A transformer architecture for retention time prediction in liquid chromatography mass spectrometry‐based proteomics. Proteomics, 23, e2200041. https://doi.org/10.1002/pmic.202200041.
      Rizzo, R., Fiannaca, A., Rosa, M. L., & Urso, A. (2016). A deep learning approach to DNA sequence classification. In: C. Angelini, P. Rancoita, S. Rovetta (Eds.), Computational intelligence methods for bioinformatics and biostatistics (pp. 129–140). Springer International Publishing. https://doi.org/10.1007/978‐3‐319‐44332‐4_10.
      Ruppert, K. M., Kline, R. J., & Rahman, M. S. (2019, January). Past, present, and future perspectives of environmental dna (edna) metabarcoding: A systematic review in methods, monitoring, and applications of global edna. Global Ecology and Conservation, 17, e00547. https://doi.org/10.1016/j.gecco.2019.e00547.
      Sadad, T., Aurangzeb, R. A., Safran, M., Imran Alfarhood, S., & Kim, J. (2023, April). Classification of highly divergent viruses from DNA/RNA sequence using transformer‐based models. Biomedicine, 11(5), 1323. https://doi.org/10.3390/biomedicines11051323.
      Schoch, C. L., Seifert, K. A., Huhndorf, S., Robert, V., Spouge, J. L., Levesque, C. A., Chen, W., Fungal Barcoding Consortium, Fungal Barcoding Consortium Author List, Bolchacova, E., Voigt, K., Crous, P. W., Miller, A. N., Wingfield, M. J., Aime, M. C., An, K. D., Bai, F. Y., Barreto, R. W., Begerow, D., … Schindel, D. (2012, March). Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for ifungi/i. Proceedings of the National Academy of Sciences, 109(16), 6241–6246. https://doi.org/10.1073/pnas.1117018109.
      Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers). Association for Computational Linguistics. https://doi.org/10.18653/v1/p16‐1162.
      Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567 http://arxiv.org/abs/1512.00567.
      UNITE Community. (2017). Unite top50 release. UNITE Community. https://doi.org/10.15156/BIO/587477.
      van der Maaten, L., & Hinton, G. (2008). Visualizing high‐dimensional data using t‐sne. Journal of Machine Learning Research, 9(Nov), 2579–2605.
      Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv. https://doi.org/10.48550/ARXIV.1706.03762.
      Vu, D., Groenewald, M., de Vries, M., Gehrmann, T., Stielow, B., Eberhardt, U., Al‐Hatmi, A., Groenewald, J. Z., Cardinali, G., Houbraken, J., Boekhout, T., Crous, P. W., Robert, V., & Verkley, G. (2019, March). Large‐scale generation and analysis of filamentous fungal DNA barcodes boosts coverage for kingdom fungi and reveals thresholds for fungal species and higher taxon delimitation. Studies in Mycology, 92(1), 135–154. https://doi.org/10.1016/j.simyco.2018.05.001.
      Vu, D., Groenewald, M., Szoke, S., Cardinali, G., Eberhardt, U., Stielow, B., De Vries, M., Verkleij, G. J. M., Crous, P. W., Boekhout, T. J. S. I. M., & Robert, V. (2016, September). DNA barcoding analysis of more than 9 000 yeast isolates contributes to quantitative thresholds for yeast species and genera delimitation. Studies in Mycology, 85(1), 91–105. https://doi.org/10.1016/j.simyco.2016.11.007.
      Vu, D., Groenewald, M., & Verkley, G. (2020, July). Convolutional neural networks improve fungal classification. Scientific Reports, 10(1), 12628. https://doi.org/10.1038/s41598‐020‐69245‐y.
      Vu, D., Nilsson, R. H., & Verkley, G. J. M. (2022, June). Dnabarcoder: An open‐source software package for analysing and predicting scpDNA/scp sequence similarity cutoffs for fungal sequence identification. Molecular Ecology Resources, 22(7), 2793–2809. https://doi.org/10.1111/1755‐0998.13651.
      Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007, August). Naive bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267. https://doi.org/10.1128/aem.00062‐07.
      Xu, B., Zhang, Z., Sha, C., Du, M., Song, H., & Wang, H. (2022). A three‐stage curriculum learning framework with hierarchical label smoothing for fine‐grained entity typing. In: A. Bhattacharya, J. L. M. Li, D. Agrawal, P. K. Reddy, M. Mohania, A. Mondal, V. Goyal, & R. U. Kiran (Eds.), Database systems for advanced applications lecture notes in computer science (pp. 289–296). Springer International Publishing. https://doi.org/10.1007/978‐3‐031‐00129‐1_23.
      Zhou, Z., Ji, Y., Li, W., Dutta, P., Davuluri, R., & Liu, H. (2023). Dnabert‐2: Efficient foundation model and benchmark for multi‐species genome. arXiv. https://doi.org/10.48550/ARXIV.2306.15006.
      Zhou, Z., Wu, W., Ho, H., Wang, J., Shi, L., Davuluri, R. V., Wang, Z., & Liu, H. (2024). Dnabert‐s: Learning species‐aware dna embedding with genome foundation models. arXiv. https://doi.org/10.48550/ARXIV.2402.08777.
    • Contributed Indexing:
      Keywords: deep learning; fungi; metabarcoding; mycology; neural networks; transformers
    • Accession Number:
      0 (DNA, Ribosomal Spacer)
      0 (DNA, Fungal)
    • Publication Date:
      Date Created: 20240817 Date Completed: 20241002 Latest Revision: 20241002
    • Publication Date:
      20241003
    • Accession Number:
      10.1111/1755-0998.14006
    • Accession Number:
      39152642