Odd Taxi: How does Anime Dialogue sound Natural?

Why does Odd Taxi's dialogue sound "natural"? This question seemed dubiously simple to me when I first considered it. I attributed it to be a similar situation as Your Name, whereby many of the voice actors including our main leads, are amateurish and have little experience in creating that 'anime voice'. However, that is not the case for Odd Taxi at all. With big names like Hanae Natsuki and Kimura Ryouhei, the element of professional experience may not be enough to explain why Odd taxi's dialogue sound natural.

Perhaps then, we can consider the intentionality to act out a certain rawness; While Iida Riho can voice the highly-stylized Rin in Love Live, she too can tap into her own daily experiences to bring out a natural-sounding voice for Shirakawa. The tone, the pronunciation, the way someone breathes, the way someone tells a joke; These are all elements we can consider when trying to decipher how a professional voice actor can sound "natural". And perhaps as we dive deeper into this discussion, we can talk about other elements that can add to the sense of realism as well, such as the layering of reactions or the adherence of space.

Disclaimer: This essay will only focus on episode 1 & 2 of odd taxi. It will also borrow examples from other anime for the sake of comparison, however none of those examples will carry any major spoilers (I will still post a warning regarding the scenes I will reference to in my videos). Finally, the analysis of sound itself is an extremely impressionistic topic to touch upon. After all, how a conversation may sound to an introvert may be widely different from how an extrovert views a "normal" conversation. Hence rather than stating facts, I want this essay to act as a catalyst for you to reflect upon what realistic dialogue means to yourself. I would love to hear any of your personal feelings and I will graciously hear out any contrary opinions.

The Cadence of the Voice

Let us try a small exercise.

Above us are two common sitting postures; On the left, you are sitting upright, while on the right, you are slouching forward, slightly resting on your stomach. First off, you can try the upright posture and try to say this line: "Shirakawa is best girl". Try to project at a comfortable volume as well -- Not too soft or too loud -- As if you are speaking to someone sitting in front of you. Now, try the slacker position. Relax your shoulders and slump forward like a pile of jelly. Your body can be leaning at whatever direction or angle you want so long as you feel comfortable and lazy. Now, repeat the line: "Shirakawa is best girl".

Do you notice a difference in the cadence of your voice? For me personally, when I'm sitting upright, its cadence sounds normal. However, when I slouch my body forward, my voice may end up sounding more nasally instead. Why that is so, is because when you are sitting upright, there isn't a lot of pressure weighing down against your diaphragm, and thus when you try to project your voice, the air can comfortably travel from there out of your mouth. But when you are slouching forward, there is more pressure on that region, and hence what you instinctively do is to move your voice up to your throat or your nose, creating that nasally voice. If you are speaking comfortably, it isn't to say that all your syllables become nasally-sounding. Instead, it is usually the plosives ('t','k','p' etc) and stresses of a word, or rather syllables that would otherwise need more force, that would feel more restricted; Hence, the nasally-sound. With that said, let us watch a short video and look at how our twitter cloud chaser uses his voice.

You will come to realize that he uses a lot of that nasally voice that we discussed, almost like his entire body is slouching lazily against his seat. In some of his softer sentences, you won't notice too much of a difference, however when he becomes emotional such as when he is saying "It's not pointless!", you will realize that his voice would travel more from his throat/nose in an attempt to project himself. He also has a speech pattern where he sometimes drag out a few of his syllables, such as 'tanjuniiiii' and most often dragging out the syllables at the end of his sentences, like 'naaaaa' or 'laaaaa'. Hmm, does this type of speech pattern sound familiar to you? For me, I can imagine myself speaking like this when I'm trying to explain things in a lazy manner; "Just go ahead and do thissss, and then do thaaaaat, then that will happennn."

The nasally voice and the dragging of syllables is often used in the context of us being relax or lazy. After all, using our diaphragm to project takes a lot more effort than to just speaking with your throat/nose. These are the same vocal techniques that professional singers use when they say that singing with your diaphragm allows you to have more power, in contrast to singing with your throat. Similarly, these are also the same vocal techniques that are beaten into professional voice actors. Anime characters are often exaggerated beings; When they get emotionally, they have a tendency to always resort to belting out their voice from the bottom of their belly. Hence, in a need to keep that consistency, VAs will instead be told to always project with their diaphragm, even when normal circumstances doesn't call for it.

This may not necessarily be a conscious directorial decision as well. If you have ever seen a recording studio, you will realize that every voice line is recorded with the voice actors standing upright, giving us that clear and powerful voice. However, even when their characters are slouching or laying down, the VAs won't necessarily change their own posture, choosing to remain standing upright. This isn't to say that these are poor voice acting decisions. After all, in a recording environment, having the VAs prance around the room may prove to be the sound editors' nightmare. However, this nonetheless creates the slightest sense of artificiality, one that we have grown used to hearing, and yet remains lingering at the back of our minds.

In strict adherence to create consistency, we can consider clarity as well. Let's watch a great example of that in the very same episode. (Video below)

Notice how the news caster projects her voice clearly, pronouncing every single syllable with proper emphasis. In contrast, you will realize how our twitter poster often slurs his syllables together, even combining two of them into one. In-universe, this difference in enunciation makes sense due to who our characters are and who they are speaking to. However, from a production stand-point, we can make some conjectures as well. In almost every single anime, being able to convey your words to the audience is pivotal and failing to do so may make your production seem amateurish as a result. In many voice acting schools then, VAs are often seen doing tongue-twisters, just to train themselves to enunciate with crystal-clear clarity for their recordings. What results are certain peculiarities that have pervaded across all anime, such as the way that Ayanokoji from Classroom of the Elite speaks. (Video below shows S1EP1 of the show).

In this comparison, both characters have very similar speaking patterns, mainly in their monotonous and aphoristic way of speaking. However, the difference comes in how Odokawa mumbles his sentences while Ayanokoji enunciates each of them. In a sense then, Ayanokoji speaks as if he is a natural orator, more so than the old man that is twice his age. If you have watched the rest of CotE or even any modern anime, you would realize that almost all of them speak in that same clarity and projection. Regardless of whether they are high schoolers, NEETs or beggers by the streets, they all carry their voices as if they are natural orators. In cases when they aren't then, it is more so often to play up a trope, rather than an effort to make characters sound like actual people.

This isn't to say that the tendency to speak in a nasally voice and mumble your words is the human standard of speaking. After all, you can probably point out a lot of your friends and family who effortlessly carry their voice with great clarity. Perhaps then the reason why this feeling of realism still shines through is because anime rarely shows the other side of the coin. We are so used to seeing twelve-year-olds sounding like public speakers, that we often forget that kids are often mumblers in real life. It is for this same reason why hiring amateurish VAs provides that same touch of realism as well. In having yet master their voices, we are shown a rare glimpse of raw dialogue that we never see in anime; A conversation where people slur their words, downplay their emotions and speak "unattractively".

The 'Space' in a Conversation

Why was Odokawa and Shirakawa's minute long silence in the taxi, so natural?

The answer should be intuitive by now. Anime often don't show the awkward pauses of human conversations, and hence when they do, they instantly become memorable and highly relatable. But there are also much more to the rawness of conversations, than just the occasional awkward silences as well. When considering the "space" of a dialogue, we can also think about the lack-thereof (Video below shows EP1 of Engage Kiss).

This comparison has both shows doing a typical 'Manzai' routine, a japanese variation of the age-old straight man and wise guy duo. First off, let's isolate some factors. We can intuitively feel that Odd Taxi's version of the routine sounds more natural, because of the more monotonous tone that Odokawa and Shirakawa carries themselves with. However, the opinion that down-to-earth emotions may sound more realistic as compared to anime hijinks, is probably a trite point to make, so let's isolate that factor. Instead, let us consider solely the "space" of the dialogue, or to put in less fanciful terms -- The time between action and reaction.

In a 'Manzai' routine, the wise guy would usually kick off the action by saying something wacky, to which the straight man would usually counter-respond with logic and reason. Engage kiss does a perfect example of that, with Kisara's obsessive behavior being 'tsukkomi-ed' by Shuu's nervousness. However, take note of Shuu's reactions; In-between many of Kisara's ramblings, you would often see him being reactionless, only choosing to cut in when she ends off her sentences. This begs the question: What is stopping him from reacting to Kisara, at the beginning or in the middle of her sentences? When she turned away from him and spoke in resignation, it wasn't as if he couldn't understand what she was getting at. In fact, one might've already been reacting the moment she said "there's nothing I could've done..." Similarly, when Kisara's meltdown began, could we have expected Shuu to make more of an effort to deny her, rather than just standing still? While the scene was nonetheless comedic in its music and visuals, the back-and-forth may seem scripted and choreographed, simply from the fact that action and reaction was split like panels of a 4-koma. We can compare this to Odd Taxi, which interlace both action and reaction together. When Shirakawa opens up with "Hitsuji..." which means sheep in japanese, Odokawa had already begun reacting to her with his own straight-laced comeback. Similarly, Shirakawa does not need to finish saying "MVP", for him to immediately understand the context and be surprised by it. In pure terms of action and reaction then, Odd Taxi simply flows more naturally from the fact that their characters respond to contextual clues. It is for this reason why the later group conversation at the diner has flashes of naturalism as well, simply from the fact that characters speak over one another (Video below).

Thus far, in our discussion of space, we have looked at examples on how a lot of it or none at all, can help us disturb the static rhythm of scripted dialogue. However, now let us consider other aspects of manipulating around space. The next two points I will be making, has little relevance to Odd Taxi, but I will bring it up solely for the sake of completing the analysis (and because I love both shows).

Let us consider silence once again; Think back to a time when you were talking with someone. That person might have said something thought-provoking to you and you needed to take a few seconds to think about your response. Now, question: How did you fill up that small void when the other person awaits your response? Perhaps the answer may be found in this next clip, in the first episode of SSSS Dynazenon.

Vocalised pauses, like the 'Erms' and the 'ahs', pervades many of our conversations. 'Erms' can be us gathering our train of thought; 'Ehs' and 'Ahs' can be a response of surprise; 'Ohs' can be a gasp of realization; Even the 'I sees' can be a non-committal response, neither agreeing nor disagreeing, but simply acknowledging. While these can often be seen as verbalized hesitation and lack of confidence in speaking, it is no doubt almost universal amongst all humans. After all, us human beings often have a shaky train of thought. We don't always have the wittiest responses nor the most appropriate ones. As our defence mechanism then, we have weaponized our 'erms' and 'ahs' to make that void ever less awkward.

This isn't to say that many anime forgo these vocalised pauses. For example, 'Ohs' and 'ahs' are common in many of them as simple acknowledgment noises. However, 'erms' does seem to be rarer for many anime characters. Why that is so is because most of our beloved heroes often have an instantaneous, almost reflexive train of thought. In an attempt to portray themselves as confident figures, or simply as an act to infodump the audience with acute conciseness, we rarely see them carry any semblance of hesitation in their speech, even in moments of great emotional trauma. Many anime then, can be said to have forgotten the pauses of natural speech.

Vocalised pauses doesn't need to take the form of the usual suspects as well. In SSSS Dynazenon's case, we may have Yume dragging out her "kyou (today)" or Yomogi dragging out his "demo (but)" as their slight forms of hesitation. Vocal inflection can also tell us a lot about their current feelings. In Yume's case, the most obvious would be when she says "dayone (that's true)" with a slight chuckle and almost as if she is exhaling out the syllables. This creates a sense as if she is laughing at her own awkwardness, being embarrassed by it. Similarly, Yomogi's "n-nande desuka (but why)" with its stutter adds to the overall flavour of the scene as well. Once again, it isn't to say that having vocalised pauses and stuttering is the human standard of speaking. However, often times people have to actively train themselves in stopping these audible hesitations, to pursue an act of confidence. In contrast, it is more often in our daily lives do we encounter people who still frequently fire off their 'erms' and 'ahs' or other weird noises to fill the space. As these noises become rarer in anime characters that sound like natural orators, it feels refreshing when a show like SSSS Dynazenon rolls around and be recommended exclusively for its awkward, and yet realistic dialogue.

And lastly, what of space? Now I mean it literally. What about the actual space, the room or area that the conversation is taking place in? Have you ever wondered why old Ghibli films sound strangely realistic? Have you ever thought about why Sonny Boy sound so different from every other anime out there?

The answer is simple: Sound mixing. That's it. The key to creating a natural sounding anime is to level the sounds of the voices and the environment realistically. Let us once again learn from comparison, between a scene from the first episodes of Sonny Boy and Shikimori's not just a Cutie (Video below).

The major difference I can point out is in the volumes in which the characters speak. For Shikimori, you will notice that the characters are very loud; Every inflection they make peaks our ears and even the little 'hmms' by Shikimori resound in our skulls. You can imagine the exact scenario these voice lines are recorded in, which is in a sound-proof chamber, spoken less than a millimeter away from the mic. In contrast, Sonny Boy does the opposite. Its characters ranges between soft and loud, often with an echo added as if to mimic an actual open roof conversation. And that's the difference -- The relationship between the characters and the environment.

In Shikimori, you will notice how the volumes of their characters greatly overwhelms the background noises of the bowling alley. However, if you ever been to one before, you will realize that the environment is loud as hell, the sound of the pins being knocked down floods the entire facility and you will have to actively project your voice even when you are speaking to someone next to you. Now, it isn't to say that the sound designing of Shikimori is bad in this regards, in fact, it is the opposite; Rather than create a realistic yet cacophonic environment, the better directorial decision is to focus on the comfy jokes our characters are sharing instead. However, this nonetheless sets the tone that is almost intuitive to our ears and our minds; Shikimori would rather have an environment cater to their characters, rather than their characters dwelling within the environment. The latter of this option is what Sonny Boy chooses; From Nagara's POV, you will notice that Nozomi's voice lines would be distanced and filtered with an echo; From Nozomi's POV, some of Nagara's line may be near inaudible; And all of this is with the incessant cries of the cicadas and the birds that sought to drown out their voices. For Sonny Boy, it is the conscious decision to make the environment and the characters compete for discernibility, does it relate to our real life sensibilities. Similarly then, the tone is set; Sonny Boy is an anime that doesn't just want to sound like it was recorded within sound-proof walls, but rather exist out in the real world.

On the later points of vocalised pauses and realistic sound mixing, you will realize that even Odd Taxi doesn't adhere to these raw elements. Many of their characters speak with great coherence, barely vocalising any of their hesitations. Odokawa's voice lines may seem to sound exactly the same, whether it is in the taxi, the diner or the park, lending little consideration for the environment. However, this isn't to say that Odd Taxi sounds "off". For an anime that centralizes its story progression in their complex dialogues, it makes sense that the clarity of speech takes precedence over realism. Rather the sound directors made the decision to tap into other forms of natural speech, mainly in the unique cadences of a voice.

Conclusion

Every sound in anime is a conscious and stylistic decision. That is the main takeaway of this essay. Whenever characters speak in a monotone or nasally voice, there are certain impressions that the show is guiding you to have towards the characters. Whenever you hear a lack of OSTs and hear an influx of environmental noises, there are certain spatial immersion that the show is trying to establish.

Perhaps as we watch a show then, I want to prompt everyone of y'all to truly think about the elements of sound you are hearing. If you believe Odd Taxi to sound realistic to your ears, think about how that impression was created. Was it because of the amateurish voice acting or because the monotony of conversation are relatable to your own? What does Odd Taxi have that other shows like Shikimori or Sonny Boy have, or doesn't have?

For me, I think it is when we dissect all these elements of sound, can we truly appreciate the amount of effort and creative juices that goes into recording some of our favourite anime.

Search This Blog

Idle Master