Voice AI: Language Configuration – Talkative

Voice AI can support a variety of languages to conduct your interaction in. We leverage several STT/TTS providers to provide a larger coverage of language support - however, consideration should be made around the quality and compatibility of languages both for transcription using STT and the Voice AI voice using TTS. In addition to this, our S2S model provided by OpenAI can support multiple languages and swap between them on the fly. Whilst this is possible with S4TS - there are constraints depending on the matching of languages between the TTS, LLM and SST, more on this can be found in the S4TS section.

S2S

Speech to Speech, which is provided by OpenAi does not list languages it specifically supports, but rather indicates it has coverage for most common spoken languages. Whilst this is true, the quality of the language as spoken by the bot will vary by language. Some users have reported that the Voice AI speaks a non english language with an American accent.

Unlike S4TS models, you do not specify your language or choose a particular language other than by prompting the bot. The bot can be prompted to speak only in a specific language or subset of languages or you can tell it to respond in the language initially spoken to it by the customer. You can additionally prompt it to speak with a specific accent, however, the quality of these varies by area. For example an Irish accent works well, but a Welsh accent does not.

We recommend S2S in cases where English is the primary language to be spoken, or if you’ve conducted extensive testing with the S2S model in the target language to ensure pronunciation and the dialect is accurate.

S4TS

Speech to text and Text to speech work differently from S2S in that there are up to three separate AI components handling the interaction. The speech-to-text (STT) which converts the customer’s voice into text, which is then passed to the LLM to process and generate a response in text, and then the text-to-speech (TTS) which converts the text into spoken work, acting as the bots Voice. This setup is referred to as S4TS. This section will discuss the nature of the language selection since, unlike S2S, this can be defined.

STT

STT is the act of converting the customer’s voice into text for an LLM to process. For greater language support and flexibility we integrate with the following vendors:

Deepgram

Google Speech

When configuring these in the Voice AI bot you will be presented with a list of languages to choose from:

The associated model will be automatically updated to choose the best model option depending on your language selection. Some languages are only available with certain models. For Deepgram, which is the recommended vendor, a list of language support vs models can be found here:

Models & Languages Overview | Deepgram's Docs

Google SST

General Documentation

Speech-to-Text documentation | Cloud Speech-to-Text Documentation | Google Cloud

Model & Language Support

Speech-to-Text supported languages | Cloud Speech-to-Text Documentation | Google Cloud

Deepgram STT

General Documentation

Getting Started | Deepgram's Docs

Model & Language Support

Models & Languages Overview | Deepgram's Docs

📎

Where possible, you should select a specific language to provide a greater transcription accuracy, however, if this is not possible, multilingual can be selected to provide cross language support.

LLM

The text generated using the above model will then be passed to an LLM along with the rest of the transcription, any prompts and tools which have been made available for the LLM to use. We currently use OpenAI or Azure OpenAI models for VoiceAI - and like S2S they do not list a specific language support rather they work on best effort based on the training data available in the target language.

The quality of the generated text will depend on the language, with English being heavily supported and non-English languages varying in quality depending on the commonality of the language being spoken. The LLM may additionally use more informal language than you might expect.

We can overcome some of these issues by prompting the AI on how to respond, including information on the formality and telling it to consider the region to account for regional language variations for non standard slang.

TTS

The final part of this process is the TTS, which takes the text generated by the LLM and converts it into audio to be played to the customer. Our current recommended providers are:

ElevenLabs

Google Speech

Google TTS

General Documentation

Text-to-Speech documentation | Cloud Text-to-Speech API | Google Cloud

Voice & Language Support

Supported voices and languages | Cloud Text-to-Speech API | Google Cloud

ElevenLabs TTS

General Documentation

Text to Speech | ElevenLabs Documentation

Model & Language Support

Text to Speech | ElevenLabs Documentation - these are the “models”

Voices

Voices | ElevenLabs Documentation

With TTS, the language support will depend on the voice which you wish to use:

Different voices support different language sets. Some are a single language only, others offer larger language support. You should choose a voice with a language support which best covers the languages you chose in the STT section. The Elevenlabs voice page allows you to preview the voice in the different languages to tests it’s quality. Consideration should be made to regional dialects, for example a Portuguese voice may be trained on Brazilian Portuguese instead of European Portuguese.

Cascading Considerations

To ensure the best compatibility between models you should ensure the language overlaps are consistent. For example choosing a STT model which supports German, and then telling your LLM to generate responses in German and then choosing a French speaking voice will cause issues and poor performance in the voice bot. It is also important to consider the limitations of the LLM in generating suitable text for the TTS Voice to speak. Considering how the data flows from one model to the next can help you understand and resolve issues with poor quality transcriptions and poor quality audio output.

Non English Language Support

As English is the predominate language this language is the most well supported by these platforms. Non-English languages may encounter quality issues. Talkative can offer guidance and support to improve the quality, but as the limitations are within the upstream providers, we may not be able to overcome them. This is an industry wide issue which we expect to improve over time. Please also note that Talkative’s implementation team is primarily English speaking as of late 2025.

Requesting Additional Language Support

In some cases, we may be able to offer a bespoke integration with a specialised STT/TTS provider to overcome language issues. However, it is important to consider the LLM aspect of the flow may prevent proper pronunciation even with a bespoke TTS provider. For Talkative to consider a bespoke integration, commercial terms would need to be discussed to cover the R&D associated with investigating the feasibility of the implementation as well as the implementation itself. We reserve the right to decline any integration at any point during the development if we believe the integration would prove detrimental to the product.