This paper discusses the process of generating natural language music descriptions, called captioning, using deep learning and large language models. A novel encoder architecture is trained to learn large-scale music representations and generate high-quality embeddings, which a pre-trained decoder then uses to generate captions. The captions used for training are from the state-of-the-art LP-MusicCaps dataset. A qualitative and subjective assessment of the quality of created captions is performed, showing the difference between various decoder models.