Embodiments of the present disclosure relate to generation of a media file, and more particularly to, a system and a method for generating an audio-visual based on input of at least one or more texts, one or more audios and one or more images. Technical Field of the current invention pertains to Computer Science, wherein process for which Patent Application is filed tend to generate an audio-visual from the text.