How Hacks Happen

AI Part 3: Tools, Technology, and Magical Connections

Many Worlds Productions Season 3 Episode 18

Let's explore the technology behind audio and music AI, and how it makes use of some cool math concepts to bring us convincing and realistic voices and music. Hear some actual AI-generated tunes, and find out how "latent space" changed the way AI generates sounds and music.

Resources


Send us a text

Everyday AI: Your daily guide to grown with Generative AI
Can't keep up with AI? We've got you. Everyday AI helps you keep up and get ahead.

Listen on: Apple Podcasts   Spotify

Support the show

Join our Patreon to listen ad-free!

AI Part 3: Tools, Technology, and Magical Connections

Welcome to How Hacks Happen. In this episode, we’ll be looking at some of the tools used for artificial intelligence, otherwise known as AI. The thing is, scammers use these tools, too. And if you know how the tools work, you can empower yourself with a greater understanding of how scammers use AI to try and fool you.

Because this is an audio episode, I’m going to focus on audio-based AI, like text-to-speech and music. 

First up is text-to-speech AI, which relies on voice cloning. One of the frontrunners is a tool called ElevenLabs which I talked about in AI Part 2, where I gave a little demo of my own synthesized voice, after I cloned it with ElevenLabs. And I talked about how scammers can use a synthesized voice to fake any accent and have it say anything they want.

But how does this tool really work? It’s kind of based on the way babies learn languages. 

Have you ever witnessed the process of a child learning to talk? Of course you have. What we think of as a normal part of human development is something of a miracle. A baby hears sounds, lots of sounds, and eventually figures out that some of them mean certain things. Of course there’s “Mama” and “Dada,” and they quickly learn that “milk” means that yummy white food stuff they put in their mouth, and “ball" means that round toy thing. And on top of that, they figure out how to manipulate the muscles within their mouths and their throats to make these same sounds, eventually learning not only the language, but the exact way of saying it with the accent that they hear every day, whatever that may be. 

When you consider that while your own children or grandchildren are learning English in this manner, there are millions of children in other countries learning their countries’ native languages and accents in this same way, it’s kind of mind-boggling. They’re absorbing the subtleties of their languages’ grammar and slang, making the exact sounds their parents make with the same facial muscles their parents use, to make sounds that are specific to their language.

There’s also an element of training. When babies are talked to, often, with simple sentences and overexaggerated enunciation, it accelerates their speech and language development. The babies are all about imitating the sounds that they hear. Eventually they understand what each sound means, but when they start off, it’s just all imitation.

But while a child takes a few years to learn to imitate all the sounds of their language, AI does it in hours, or even minutes. Which is pretty amazing.

Let’s take a look at the process you can use to synthesize your own voice, and to make it say things that you type in as text. Things you never actually said, but the AI software can make it sound like you said.

Let’s walk through the steps at ElevenLabs, which is free to play around with.

If you want to play along, go to ElevenLabs.io and sign up for an account. I recommend you do this on a PC, not your phone. It’s going to make it easier for you to use.

If you’re still on the home page after signup, click the Go to App button to open the app on your PC.

You’ll see a menu at the left side of the screen. I’d recommend starting off with the Text-to-Speech tool under “Playground”. In the middle of the screen, type a sentence or two, whatever you want it to say. Then click the “Generate Speech” button at the bottom middle of the screen. After a few minutes, your newly generated audio will start to play. 

Synthesized voice: I just typed some random words into the app, and used the default voice it gave me, with the default settings, and this is what I got. 

If you like what you hear, you can click the Download button at the bottom right of the screen, and save it as an MP3 file.

At the upper right, you can use the Voice dropdown menu to pick different voices. You can search for voices based on accent or language or a bunch of other criteria.

If you want to clone your own voice, you can do a quick version with the free plan. On the menu at the left, click the plus button next to Voices, and choose the Instant Voice Clone option on the menu that pops up. Here you can upload a few minutes of your voice to make an instant clone. Give it a name, and now you can use it with the text-to-speech option to make yourself speak.

You’ll also see settings at the right for Speed, Stability, Similarity, and Style Exaggeration. Having a higher Similarity might make the voice sound kind of monotone, while cranking up the Style Exaggeration will give you greater variations in the pitch and rhythm.

These settings are key to making a voice sound natural. But while you’re playing around with these settings, scammers are doing the same thing. The one big advantage you have is that you know what a natural voice sounds like in your native language and accent. And scammers often don’t. They’ll make a voice that sounds fine to them, but if you listen closely, you can spot that it’s not really natural speech.

One of the big giveaways is the word “kindly”, as in “Kindly look at this invoice.” Who do you know who actually uses the word “kindly” when they’re asking for a favor? I certainly don’t know anyone who does. They say “please” as in, “hey, could you please take a look at this for me? I’d really appreciate it.” Something like that.

This is because the word “kindly” is used in polite English communication in India, Nigeria, and the Philippines. Nothing wrong with that, but if the person communicating with you is claiming to be from a native English speaking country, the word “kindly” signals it’s a scam. 

If you call out a scammer on their weird phrasing, they might respond with, “That’s because English isn’t my native language.” Well, then, why do you seem to be talking with an accent associated with the English language? You might get them to admit they used voice synthesis, but then they’ll claim it’s because they’re embarrassed about their accent. Oooh, big red flag. Where did that accent come from? Oh, it’s because your mom is Danish and your dad is from Mexico? So, even though you grew up in Montana, you have this weird accent. Un-huh.

News flash: Kids who grow up in the United States have United States accents. It doesn’t matter what accent their parents have, they get their accent from their teachers and their friends, and also from TV and the media. Same goes for Canada, UK, Australia, New Zealand, and any country, really. I am living proof of this. My mother had a profound Quebecois accent, and although I can imitate it really well when I am trying to fool somebody into thinking I am Canadian, my everyday accent is from the country where I grew up, which is the United States.

Yeah, this is a lot. You hear a voice, in a recording or a video or a phone call, and you have to unpack all that in a few brief seconds. Does it sound real? A lot of the time, you just have to go with your instincts. If something sounds a little bit off, it might just be a synthesized voice.

It might help to know how voice synthesis works under the hood. The technology behind it is actually pretty fascinating. It’s the result of decades of work, figuring out what actually makes a voice, and how these identifying markers work in tandem with those settings like Similarity and Style Exaggeration.

Let’s break down the steps.

The first step is the voice cloning, or analysis. AI voice cloning starts by working with a sample of a person’s voice. In my case, I gave ElevenLabs around three hours of audio from my podcast, which is a pretty good sample size.

The cloning algorithm takes the audio waveform of the sample, and looks for human-relevant features, like waveforms associated with how the vocal tract shapes sound. Your vocal tract is the whole system that makes your voice sound the way it does, including your lungs, your vocal cords, your throat, the shape of your mouth, and how much of it goes through your nose. Like this, this is a nasal voice, like Fran Drescher in the TV show The Nanny. That’s very nasal.

The cloning algorithm converts all this information into numerical data.

And while everyone’s vocal tract is unique, which is why our voices all sound different, the range falls within certain boundaries. Like, do you remember that scene in the movie Splash, when the mermaid, played by Daryl Hannah, tried to tell Tom Hanks her real name? Because the mermaid was more fish than human, her name sounded like this.

Man: What's your name?

Mermaid: You can't say my name in English.

Man: Say it in your language.

Mermaid: My name is Eeeeiiiiiii Eeeeiiiiiii!

That’s the sounds of a bunch of TVs breaking at the end of that clip. It was pretty funny. Yeah, that name was most definitely not expressed within the range of a human vocal tract. 

This thing of AI’s analysis of a voice working within the boundaries of the sounds humans make when they speak normally, this will become important in the later step of voice synthesis.

But let’s get back to the analysis part, the cloning. After the waveform analysis, the AI software tracks the rise and fall of the speech, the intonation, to get the rhythm of that speaker.

And then it looks for how the phonemes are expressed. Phonemes are the smallest units of speech, like the “ke” sound in “cat”, or “be” in “bat”.

Put this all together–the vocal tract imitation, the rhythms, and the phonemes–and you have a pretty good mathematical representation of a human voice.

But that’s not all. Now we get to the synthesis part, the text-to-speech feature in ElevenLabs and other software. That’s when we use all this data to make a new audio presentation of that voice, saying sentences that the original voice never said, but sounding just like that original voice would say it.

When AI analyzes your voice, it’s not just your voice that it has stored away there. It has thousands of other voices, and it uses all these voices together. Allow me to explain.

That mathematical representation of your one voice is actually a long series of numbers, each one representing some different aspect of your voice. One number identifies pitch, whether it’s high or low. Another one might represent something called “Harmonic-to-Noise Ratio,” which is how smooth or raspy a voice is. Like, this is a smooth voice, and this is a raspy voice, which could be said to have a lot of “noise” in it. I don’t know how long I can keep up this raspy thing, it’s kind of tough for me. <clears throat> Whew, I’m glad that’s over.

Anyway, all these numbers, these data points about a particular voice, are put into this huge multi-dimensional graph called latent space. Think of it as kind of like 3D chess, but with multiple dimensions. It’s actually not something we humans can visualize, since we can see only three dimensions. But math can do it, this thing of making a multi-dimensional graph.

So your one voice is put in this latent space with all your numbers, but it’s there along with all the other voices the AI program has. So you can see how these different voices resemble each other by seeing which ones have points close to each other, in some dimension or another. Like your voice is similar to Sally and Rachel’s voices in pitch, and similar to Cameron and Ashley and Barbara’s voices in smoothness. But it’s not just a couple of people, it’s thousands.

This thing of having all the voices hanging out together in latent space is what makes AI voice synthesis sound so convincing. When I type in a sentence I’ve never said, and tell the software to simulate me saying it, based on my audio analysis, and it doesn’t have a sample of me saying that exact thing, what it does is look at other points that are nearby in the big multidimensional graph, and sort of “consult” with them, and borrow a few characteristics. And based on my voice, plus a little “consultation” with these similar voices, it can put together really convincing audio based on the text I just typed in.

But because all the points are human voices, this consultation never strays outside the bounds of human voices. The latent space is kind of like a giant cage that contains the realistic human voice. So no matter how hard you try, you can’t make a voice that sounds like Daryl Hannah in Splash saying her name.

This, my friends, is what AI is all about. Making these thousands of connections in ways we never had before, but keeping it within realistic bounds. And doing it in seconds.

As a comparison, remember how robotic computer-generated voices sounded just a few years ago? That was before latent space, when a voice generator had only its own voice to work with. Now, with thousands of voices to consult with in its big ol’ latent space, and the AI technology to do it, we have really good voice synthesis. It’s not perfect, but it’s pretty dog-garned good, especially if used in short doses, like a sentence or two here and there.

Whew, that’s a lot of time spent on just voice AI. But this concept of latent space, this multidimensional graph of attributes represented by numbers, and consulting other points in the graph when it needs to figure out what to do, it’s actually part of just about every AI model. You might even say it’s what makes AI…AI.

And we haven’t even talked about music yet! 

AI-generated music has been around for decades, but the tools required a knowledge of programming, and they were clunky to use. Then in 2023, AI for music hit the mainstream, when the tools became a lot easier for anyone to use. Like, insanely easy. In fact, the music you’re listening to right now, I generated with AI software called Suno. It took about fifteen minutes, from the time I opened the software to the time I was listening to the song, in its final form. I typed in a text prompt, asking for an emotional piece with violins and cellos and lots of fast arpeggios, which is music-speak for notes going up and down the scale. And Suno delivered.

I like this piece of music. It’s nice. It’s a little repetitive, and borrows heavily from famous classical pieces, and it’s kind of boring, but it’s okay. It’s about on par with something I’d get from a stock music website. And just fine as background music on a podcast.

In 2023, we started to see some unusual uses of AI creep into popular music. One of the most famous cases happened in April 2023, when an anonymous artist made a song called “Heart on my Sleeve,” with vocal tracks that sounded exactly like Canadian musicians Drake and the Weeknd. The artist, who calls himself Ghostwriter997, actually wrote and produced the music, but he used AI voice filters to make the singers sound like Drake and the Weeknd. Ghostwriter997’s reason for doing this was that he’s a songwriter who wants more exposure for his music, so he had the idea of showing people how his song would sound if it had been recorded by famous singers.

“Heart on my Sleeve” was eventually taken down off of Spotify and Apple Music and YouTube and TikTok, but not til it had racked up millions of listens. This particular use of AI didn’t seem to have any scammy purpose, like, Ghostwriter997 wasn’t intending to make money off of it, but it does point up how easy it is to fool people into thinking they’re listening to a real live human being, and not AI.

Clearly, some lines needed to be drawn here. 

In September 2023, the CEO of Spotify, Daniel Ek, clarified some of the rules Spotify planned to follow with regard to AI-generated music. Things that are allowed are auto-tune tools, but not music that mimics artists to the point of confusing people. Spotify allows AI-generated music, but as long as it isn’t pretending to be from a known artist.

The music industry is still trying to figure out how to deal with AI-generated music and singing, but at least that’s a start.

I decided to try it myself, to see how viable it would be to generate a track that sounded like someone famous. I spent a few minutes in Suno messing around, and I wrote some lyrics, and eventually created what I call “Walk the Talk,” one of the mysterious Lost Tracks of Madonna, an unreleased recording discovered in a trunk in producer Jelly Bean’s attic in Brooklyn last week. What a find!

Singing: 

Look where you walk, you can't just walk
Cuz they'll all talk and talk and talk about you
Where do you go, gonna miss the show
Where do you when no one knows about you?

Nice backstory, huh? Jelly Bean was one of Madonna’s first producers, and his remixes of her early music are considered to have played a huge part in her meteoric rise to fame in the 1980s. So I could spin a nice little tale on Spotify or YouTube or TikTok, and Madonna fans would be all over that stuff. They'd snap it up like a $2 pair of black lace fingerless gloves.

Don’t worry, I’m not going to do that. For one thing, I don’t want to get sued by Madonna. Love ya, girl, wouldn’t do ya dirty like that!

I just wanted to show you how fast and easy it would be to fake a song from a well-known artist. Would you believe, the process of creating this song was under half an hour? 

Will these fake songs be a thing in the future? I have no idea. But if you like listening to my perky little 80s track, does it really make any difference if it was sung by a real person? I’ll let you decide.

So how does this AI music generation thing work? It’s actually a lot like the way voice synthesis works, for voices alone.

AI-generated music uses latent space in the same way as voice synthesis, but with numbers for things like tempo, the instruments used, and the types of rhythms. So when you ask for a certain type of music, it can consult all these different values and put together something within the boundaries of what’s reasonable for a piece of music. 

But for this reason, this staying within the boundaries, the music AI generates tends to be kind of unoriginal. For example, you’re listening to a piece I generated with a prompt for dramatic, cinematic music, heavy on the cello. And what it came up with has some serious Game of Thrones vibes, without actually being as brilliant and original as the Game of Thrones theme.

AI-generated music is kind of like, if you have a friend who knows how to play every instrument, and knows every great song ever written. So if you ask them for a song with certain instruments in a certain style, they can grab bits and pieces from their vast library of knowledge, put them together in a decently musical way, and play it for you. The result might be pretty, but will always sound kind of recycled.

So if you’re going for that beautiful sad song to make you feel all the feelings and cry those tears, or a song that will lift you into that higher plane where everything in the world is magical, or elevate that pivotal scene in your film to the next level…Nah, AI isn’t gonna do that for you. Not without some human tweaking, maybe a lot of human tweaking.

But one thing it is good for, is for inspiration. If you create some AI music, maybe you’ll get inspired to write something original, based on some of the little snippets that AI served up for you.

That’s all for today. What do you think of the AI-generated music, should I make some more? Well, I will anyway, if only because it will help me get better at spotting other people’s AI-generated music! I hope you try out these tools and have some fun with them. A little experience just might make you better at spotting the fakes when they come your way.

See you next time on How Hacks Happen.