Personally I don't know of any specific research but afaik the reason why our voices sound terrible to us when recorded is because they're missing the bass that's transmitted through the skull, i.e. they sound higher pitched on recording. So I'd try increasing the amplitude of the lower frequencies on the EQ, it'll take a bit of fiddling to get the right balance.
I've found in my experience that over the years my internal voice updated to match how I sound when recorded. So when I hear myself speaking on a recording it's much less jarring now since it feels much more like how I predicted it would sound.
I know this doesn't help your current situation but it's a fun fact I recognized since you mentioned it.
Yup same, after a bunch of podcasting I don't hear any meaningful difference between my voice in my head and in a recording.
Which means that modifying a recording to match what you hear really can't be boiled down to a series of steps, because it's highly subjective and probably not consistent.
As a singer who has recorded myself not only singing, but also talking for thousands of collective hours of recording time, this is my experience as well. I can't even hear the difference anymore, and I haven't been able to since probably hour 10.
It's possible and there has been plenty of research done into it, but it's a hard problem and not just a question of slapping an off-the-shelf EQ on it.
I think you could probably get a close approximation by first close miking your recording, and then using Adobe audition, apply the dereverb filter to your recording, roll off the highs probably above 6 to 8000 Hertz or so by 12-18 db, and boost the fundamental frequencies and 1st overtone that the speaking voice is recorded in probably six decibels or so, (a rough guess would be something like 120 hz and 240 hz for men and 200 hz and 400 hz for women, use the spectral graph to find for sure), and then put a compressor on the whole thing to limit the difference between the loudest and quietest pitch to somewhere around 70 decibels or so, and then finally put a very light reverb with a lot of damping on it. We're talking almost so little reverb that you can't even tell that there is reverb applied to it unless you a/b the sound.
Things like that should account for the difference in bone conduction, the difference in how a microphone will pick up your voice and the rest of the room, etc., But it will probably need a lot of tweaking to your specific application to sound exactly right.
A very simple approximation of your voice as it’s heard if you’re facing someone when speaking would be using a unidirectional mic and recording yourself with the mic pointing the opposite direction as it normally would be (in other words— with the polarity reversed).
A slightly better approximation would be if you did the same thing but with two unidirectional mics pointed at slight angles (with the polarity still reversed) to simulate the placement of your ears.
Obviously the quality of the mic would factor in as well—you’d want mics with a flat frequency curve. To get even pickier you’d also want to use headphones or speakers with a flat frequency curve to listen to it. Once you had the recording you could even take impulse responses of certain rooms and process the audio to get an idea of how you sound to others in specific rooms!
So to clarify, I am really curious about speech comprehension if a Text To Speech AI is trained to sound like ourselves (myself).
Over the last couple of weeks I have been working on learning about and using offline AI. I don't know why, but I decided to ask here first.
If anyone is curious, this is what my current favorite LLM has to say (Llama 2 70B Q4 GGML):
You
what does it take to tune a recording of my voice and match it to how I hear my own voice?
Assistant
To tune your recorded voice so that it matches the way you perceive your own voice, there are several steps involved in audio processing. Here's an overview of the process:
Recording: Firstly, ensure that you have a high-quality recording of your voice using proper equipment such as a microphone and soundproofing to minimize background noise. This will provide a good starting point for further editing.
Equalization (EQ): Use an equalizer to adjust the frequency balance in your recorded audio. You can boost or cut specific frequencies that are too prominent or lacking, respectively. For example, if you feel like your voice sounds nasally, you may want to reduce some of the mid-range frequencies around 1kHz and above.
Compression: Apply compression to even out the dynamic range in your recording. This will help balance the loudness between different parts of your speech or singing, making it sound more consistent.
Reverb/Delay: Add a small amount of reverb or delay to create a sense of space and depth around your voice. This can make it feel like you're hearing yourself in a room rather than through headphones or speakers.
Stereo Imaging: Use stereo imaging tools to adjust the width and positioning of your voice within the mix. This will help create an illusion that your voice is coming from a specific location, making it sound more natural.
Monitoring: Finally, listen back to your processed audio on different devices such as headphones or speakers in various environments (e.g., quiet room vs noisy cafe) to ensure that the changes you've made translate well across different listening scenarios. Make further adjustments if necessary until you are satisfied with how it sounds.
Remember, everyone perceives their own voice differently due to bone conduction and other factors, so there may be some trial-and-error involved in finding a sound that feels right for you. It's also important to keep in mind that the goal is not necessarily to make your recorded voice identical to how it sounds when speaking but rather to create an enjoyable listening experience for others who will hear it through speakers or headphones.
You
Is there any formulaic approximation or rule regarding bone conduction and how humans perceive their own voice?
Assistant
Bone conduction is the process by which sound vibrations are transmitted directly to the inner ear via the skull, bypassing the outer and middle ears. This occurs when we speak or sing, causing us to hear our voices differently than others do because of this direct transmission pathway.
There isn't a specific formulaic approximation or rule that can accurately predict how each individual perceives their own voice through bone conduction since it varies from person to person due to factors such as head shape and size, vocal tract resonance, and the acoustics of the surrounding environment. However, in general, people tend to hear their voices as deeper or richer when speaking because low-frequency sounds are more efficiently transmitted through bone conduction than high frequencies.
It's important to note that while understanding how we perceive our own voice can be helpful for tuning a recording of it, the goal should ultimately be to create an enjoyable listening experience for others who will hear your recorded voice through speakers or headphones rather than trying to perfectly replicate what you hear when speaking.
My guess is there's some subjective components to it that simple changes wouldn't be able to adjust to, each person will have to do the changes themselves to make it match what they're hearing..
If you've never done this before, a really great way to hear what your voice sounds like to other people is to stand about a foot away from a 90° corner in your house and talk normally.
Your actual voice will bounce off of the walls, making it much easier for you to hear what you sound like to others.
Professional singers use this trick early on in their training so that they can practice their tone for their singing.