TTS (Text To Song)

April 06, 2024

TTS (Text To Song)

PS:

Todd Rundgren: "I had a pretty good melodic idea, because I stole it: That sounds good--I just take that...which is essentially what musicians do. I couldn't figure out what the lyrics were supposed to be about because I wasn't much of a prose writer, and never got into poetry. There were many topics to write about then, the Vietnam war and the sexual revolution, the women's movement...I realized it was right in front of me the entire time...just start with the girl part, and get to the war part later." https://youtu.be/bFkpRel59eU?si=L_hirRdoUYlINQc-

***

Words can change everything in music; Language is music.

Even if we don't fully understand or embrace AI in an artistic practice, it is nonetheless interesting to run A-B comparisons to see if the final results of either satisfy our objectives. It is perhaps too easy to let algorithms attempt to satisfy them. In the past when I've tried Boomy, the generative music became the proxy, as in this post.

When I begin with words in music, I am immediately looking for the music. When I begin with music, the music will suggest words, that we can either use as placeholders and replace them later, or simply use the placeholders as-is.

Lately, as an experiment, I have taken my own diary entries (or combined them with diary entries of others) for a particular day of the month across many decades and used that as the mini "data set". or "LLM" In other experiments I have re-arranged older songs and derived the music from the rhythms in the synthesized voices--not how I would sing them.

Using the TTS voices, the music in the language is decidedly different, but I can control it by slicing up the vocal sample so that words can either start on an upbeat or downbeat, which is how spoken language is coherent. If we switched the emphasis such that downbeats were upbeats, it would be difficult to understand. But in music you can do this (Think Lennon's Imagine: "Imagine all the peo-PLE...", but that's not how the word is pronounced. "People" is a "downbeat" word.

The music in language is usually driving the rest of the music. I say "usually" because music AI can't really do the musical mapping--at least not as elegantly as humans. It tends to jam all the words in without taking into account that singers have to breathe. Personally, I don't think lyrics should be too wordy, and singers will usually say that. Certain words simply aren't singable by humans. I think that's what we want ultimately. But I like TTS because I can segment the audio files as an approximation of how it could be sung. In fact, TTS can be the demo that I can't sing. And you can also swap out the words with other lyrics.

Obsidian Dreams is a late-90s song of mine, with lyrics by a collaborator at the time..

This is how I sang it, and have notated it as such:

This is the TTS version notated:

This is the TTS voice set against a beat (Mix 1, Mix 2):

Lyrics generated by ChatGPT:

Query 1: Write a song titled "Obsidian Dreams"

In the twilight's gentle gleam,
Where shadows dance and softly stream,
There lies a world of unseen seams,
Where hearts are bound in obsidian dreams.

(Chorus)
Obsidian dreams, they call to me,
In the whispers of the night, I'm set free.
Lost in the depths of endless schemes,
In the embrace of obsidian dreams.

(Verse 2)
Beneath the stars, a silent choir,
Echoes of longing, burning fire,
In the labyrinth of desires, we roam,
In the realm where darkness finds its home.

(Chorus)
Obsidian dreams, they call to me,
In the whispers of the night, I'm set free.
Lost in the depths of endless schemes,
In the embrace of obsidian dreams.

(Bridge)
Through the maze of memories,
Where secrets hide, and fears unfurl,
We journey on, bound by destiny,
In the shadowed realm, we dare to twirl.

(Chorus)
Obsidian dreams, they call to me,
In the whispers of the night, I'm set free.
Lost in the depths of endless schemes,
In the embrace of obsidian dreams.

(Outro)
As the moonlight softly gleams,
And the world fades into streams,
We surrender to the silent screams,
In the enchantment of obsidian dreams.

Query 2: Write a song about the dreams and visions of Native Americans as a means for accessing knowledge and establishing a relationship to the world.

(Verse 1)
Beneath the stars, where spirits roam,
The elders gather, it's their home.
In whispered winds and ancient trees,
Their dreams reveal what's meant to be.

(Chorus)
Visions rise, across the land,
Guiding us with unseen hands.
In the heart of the fire's glow,
The wisdom of the past will show.

(Verse 2)
Through painted skies and sacred lands,
They walk in silence, hand in hand.
Their dreams a bridge to realms unseen,
Where truths are found, forever keen.

(Chorus)
Visions rise, across the land,
Guiding us with unseen hands.
In the heart of the fire's glow,
The wisdom of the past will show.

(Bridge)
Through the dance of flames, they see,
The tapestry of destiny.
In every shadow, every stream,
The echoes of a timeless dream.

(Verse 3)
With eyes closed tight, they journey far,
To mountains high and distant stars.
Their dreams the key to understanding,
The universe's gentle hand in hand.

(Chorus)
Visions rise, across the land,
Guiding us with unseen hands.
In the heart of the fire's glow,
The wisdom of the past will show.

(Outro)
In dreams they walk, both near and far,
Connecting us to who we are.
Through their visions, we're led to see,
The boundless truth of unity.

This version matches almost perfectly with the original rhyming schema, which is evidence of the fact that the better questions lead to better results.

It's the natural music in language that initially gives shape to the music. I have also done this to drive classical works where you take a musical phrase and that becomes the seed idea.

Again, AI for me is a tool. But it also begs the question whether it interferes with the spiritual nature of creativity. That should be an important consideration, unless the words are merely a form of wacky word play, and that's good too.

***

Syd Barrett's songs, namely, "Bike" were described by Roger Waters as being "painterly", which I don't think you'd get from the typical AI word churning. Human-generated work has more capacity for the charm of wabi sabi

"...he allowed the lyric and the rhythm to attach to one another in a way that was both poetic and musical...there's something away about the way the lyric attaches to the meter in a very satisfying way, "but I borrowed it" kind of kicks it off [obliquely]...the unpredictability of it combined with its simplicity made it so special. It's almost like it's a conversation about a bike. There's nothing staged or crafted about it. It's very painterly."

Also, we tend to prefer 4/4 as a metric structure as opposed to triple meters or duple triple meters, such as 6/8. We don't typically speak in 6/8 or 3/4 but lyrics can be manually mapped in those meters. Music AI doesn't (now) have that capability. Just to have music always in 4/4 is so incredibly narrow.

Search This Blog

Musings on Music (mostly)

TTS (Text To Song)

Comments

Popular Posts

At The Controls

Harmony As Everything