Klatt Voice Synthesis

Custom text-to-speech via Klatt formant synthesis with 3D spatial audio, phoneme mapping, and voice profiling.

The Klatt Voice Synthesis system provides custom text-to-speech for GS_Play projects using Klatt formant synthesis with full 3D spatial audio. It uses SoLoud internally for speech generation and MiniAudio for spatial positioning.

The system has two layers:

  • KlattVoiceSystemComponent – A singleton that manages the shared SoLoud engine instance and tracks the 3D audio listener position.
  • KlattVoiceComponent – A per-entity component that generates speech, queues segments, applies voice profiles, and emits spatialized audio from the entity’s position.

Voice characteristics are defined through KlattVoiceProfile assets containing frequency, speed, waveform, formant, and phoneme mapping configuration. Phoneme maps convert input text to ARPABET phonemes for the Klatt synthesizer, with support for custom pronunciation overrides.

For usage guides and setup examples, see The Basics: GS_Audio.

Klatt Voice Profile asset in the O3DE Asset Editor

 

Contents


Components

KlattVoiceSystemComponent

Singleton component that manages the shared SoLoud engine and 3D listener tracking.

FieldValue
TypeId{F4A5D6E7-8B9C-4D5E-A1F2-3B4C5D6E7F8A}
ExtendsAZ::Component, AZ::TickBus::Handler
BusKlattVoiceSystemRequestBus (Single/Single)

KlattVoiceComponent

Per-entity voice component with spatial audio, phoneme mapping, and segment queue.

FieldValue
TypeId{4A8B9C7D-6E5F-4D3C-2B1A-0F9E8D7C6B5A}
ExtendsAZ::Component, AZ::TickBus::Handler
Request BusKlattVoiceRequestBus (Single/ById, entity-addressed)
Notification BusKlattVoiceNotificationBus (Multiple/Multiple)

API Reference

Request Bus: KlattVoiceSystemRequestBus

System-level voice management. Singleton bus – Single address, single handler.

MethodParametersReturnsDescription
GetSoLoudEngineSoLoud::Soloud*Returns a pointer to the shared SoLoud engine instance.
SetListenerPositionconst AZ::Vector3& positionvoidUpdates the 3D audio listener position for spatial voice playback.
SetListenerOrientationconst AZ::Vector3& forward, const AZ::Vector3& upvoidUpdates the 3D audio listener orientation.
GetListenerPositionAZ::Vector3Returns the current listener position.
IsEngineReadyboolReturns whether the SoLoud engine has been initialized and is ready.

Request Bus: KlattVoiceRequestBus

Per-entity voice synthesis controls. Entity-addressed bus – Single handler per entity ID.

MethodParametersReturnsDescription
Speakconst AZStd::string& textvoidConverts text to speech and plays it. Uses the component’s configured voice profile.
SpeakWithParamsconst AZStd::string& text, const KlattVoiceParams& paramsvoidConverts text to speech using the specified voice parameters instead of the profile defaults.
StopSpeakingvoidImmediately stops any speech in progress and clears the segment queue.
IsSpeakingboolReturns whether this entity’s voice is currently producing speech.
QueueSegmentconst AZStd::string& textvoidAdds a speech segment to the queue. Queued segments play in order after the current segment finishes.
ClearQueuevoidClears all queued speech segments without stopping current playback.
SetVoiceProfileconst AZ::Data::Asset<KlattVoiceProfile>& profilevoidChanges the voice profile used by this component.
GetVoiceProfileAZ::Data::Asset<KlattVoiceProfile>Returns the currently assigned voice profile asset.
SetSpatialConfigconst KlattSpatialConfig& configvoidUpdates the 3D spatial audio configuration for this voice.
GetSpatialConfigKlattSpatialConfigReturns the current spatial audio configuration.
SetVolumefloat volumevoidSets the output volume for this voice (0.0 to 1.0).
GetVolumefloatReturns the current output volume.

Notification Bus: KlattVoiceNotificationBus

Events broadcast by voice components. Multiple handler bus – any number of components can subscribe.

EventParametersDescription
OnSpeechStartedconst AZ::EntityId& entityIdFired when an entity begins speaking.
OnSpeechFinishedconst AZ::EntityId& entityIdFired when an entity finishes speaking (including all queued segments).
OnSegmentStartedconst AZ::EntityId& entityId, int segmentIndexFired when a new speech segment begins playing.
OnSegmentFinishedconst AZ::EntityId& entityId, int segmentIndexFired when a speech segment finishes playing.

Data Types

KlattVoiceParams

Core voice synthesis parameters controlling the Klatt formant synthesizer output.

FieldValue
TypeId{8A9C7F3B-4E2D-4C1A-9B5E-6D8F9A2C1B4E}
FieldTypeDescription
Base FrequencyfloatFundamental frequency (F0) in Hz. Controls the base pitch of the voice.
SpeedfloatSpeech rate multiplier. 1.0 is normal speed.
DeclinationfloatPitch declination rate. Controls how pitch drops over the course of an utterance.
WaveformKlattWaveformGlottal waveform type used by the synthesizer.
Formant ShiftfloatShifts all formant frequencies up or down. Positive values raise pitch character, negative values lower it.
Pitch VariancefloatAmount of random pitch variation applied during speech for natural-sounding intonation.

KlattVoiceProfile

A voice profile asset combining synthesis parameters with a phoneme mapping.

FieldValue
TypeId{2CEB777E-DAA7-40B1-BFF4-0F772ADE86CF}
ReflectionRequires GS_AssetReflectionIncludes.h — see Serialization Helpers
FieldTypeDescription
Voice ParamsKlattVoiceParamsThe synthesis parameters for this voice profile.
Phoneme MapAZ::Data::Asset<KlattPhonemeMap>The phoneme mapping asset used for text-to-phoneme conversion.

KlattVoicePreset

A preset configuration for quick voice setup.

FieldValue
TypeId{2B8D9E4F-7C6A-4D3B-8E9F-1A2B3C4D5E6F}
FieldTypeDescription
Preset NameAZStd::stringDisplay name for this preset.
ProfileKlattVoiceProfileThe voice profile configuration stored in this preset.

KlattSpatialConfig

3D spatial audio configuration for voice positioning.

FieldValue
TypeId{7C9F8E2D-3A4B-5F6C-1E0D-9A8B7C6D5E4F}
FieldTypeDescription
Enable 3DboolWhether this voice uses 3D spatialization. When false, audio plays as 2D.
Min DistancefloatDistance at which attenuation begins. Below this distance the voice plays at full volume.
Max DistancefloatDistance at which the voice reaches minimum volume.
Attenuation ModelintThe distance attenuation curve type (linear, inverse, exponential).
Doppler FactorfloatIntensity of the Doppler effect applied to this voice. 0.0 disables Doppler.

KlattPhonemeMap

Phoneme mapping asset for text-to-ARPABET conversion with custom overrides.

FieldValue
TypeId{F3E9D7C1-2A4B-5E8F-9C3D-6A1B4E7F2D5C}
ReflectionRequires GS_AssetReflectionIncludes.h — see Serialization Helpers
FieldTypeDescription
Base MapBasePhonemeMapThe base phoneme dictionary to use as the foundation for conversion.
OverridesAZStd::vector<PhonemeOverride>Custom pronunciation overrides for specific words or patterns.

PhonemeOverride

A custom pronunciation rule that overrides the base phoneme map for a specific word or pattern.

FieldValue
TypeId{A2B5C8D1-4E7F-3A9C-6B2D-1F5E8A3C7D9B}
FieldTypeDescription
WordAZStd::stringThe word or pattern to match.
PhonemesAZStd::stringThe ARPABET phoneme sequence to use for this word.

Enumerations

KlattWaveform

Glottal waveform types available for the Klatt synthesizer.

FieldValue
TypeId{8ED1DABE-3347-44A5-B43A-C171D36AE780}
ValueDescription
SawSawtooth waveform. Bright, buzzy character.
TriangleTriangle waveform. Softer than sawtooth, slightly hollow.
SinSine waveform. Pure tone, smooth and clean.
SquareSquare waveform. Hollow, reed-like character.
PulsePulse waveform. Variable duty cycle for varied timbres.
NoiseNoise waveform. Breathy, whisper-like quality.
WarbleWarble waveform. Modulated tone with vibrato-like character.

BasePhonemeMap

Available base phoneme dictionaries for text-to-ARPABET conversion.

FieldValue
TypeId{D8F2A3C5-1B4E-7A9F-6D2C-5E8A1B3F4C7D}
ValueDescription
SoLoud_DefaultThe default phoneme mapping built into SoLoud. Covers standard English pronunciation.
CMU_FullThe full CMU Pronouncing Dictionary. Comprehensive English phoneme coverage with over 130,000 entries.

KTT Voice Tags

KTT (Klatt Text Tags) are inline commands embedded in strings passed to KlattVoiceComponent::SpeakText. They are parsed by KlattCommandParser::Parse and stripped from the spoken text before synthesis begins — they are never heard.

Format: <ktt attr1=value1 attr2=value2>

Multiple attributes can be combined in a single tag. Attribute names are case-insensitive. String values may optionally be wrapped in quotes. An empty value (e.g. speed=) resets that parameter to the voice profile default.


speed=X

Override the speech speed multiplier from this point forward.

Range0.15.0
Default resetspeed= (restores profile default)
1.0Normal speed
Normal speech <ktt speed=2.0> fast bit <ktt speed=> back to default.

decl=X / declination=X

Pitch declination — how much pitch falls over the course of the utterance. Both decl and declination are accepted.

Range0.01.0
0.0Steady pitch (no fall)
0.8Strong downward drift
Rising <ktt decl=0.0> steady <ktt decl=0.8> falling voice.

waveform="TYPE"

Change the glottal waveform used by the synthesizer, setting the overall character of the voice.

ValueCharacter
sawDefault, neutral voice
triangleSofter, smoother
sin / sinePure tone, robotic
squareHarsh, mechanical
pulseRaspy, textured
noiseWhispered, breathy
warbleWobbly, character voice
<ktt waveform="noise"> whispered section <ktt waveform="saw"> normal voice.

vowel=X

First formant (F1) frequency multiplier. Shifts the quality of synthesised vowel sounds.

1.0Normal
> 1.0More open vowel quality
< 1.0More closed vowel quality
<ktt vowel=1.4> different vowel colour here.

accent=X

Second formant (F2) frequency multiplier. Shifts accent or dialect colouration.

1.0Normal
< 1.0Shifted accent colouring
<ktt accent=0.8> shifted accent here.

pitch=X

F0 pitch variance amount. Controls how much pitch varies during synthesis.

1.0Normal variance
> 1.0More expressive intonation
< 1.0Flatter, more monotone
<ktt pitch=2.0> very expressive speech <ktt pitch=0.1> flat monotone.

pause=X

Insert a pause of X seconds at this position in the voice playback. Value is required — there is no default.

Hello.<ktt pause=0.8> How are you?

Combined Example

Dialogue string using typewriter text commands and KTT voice tags together:

[b]Warning:[/b] [color=#FF0000]do not[/color] proceed.[pause=1]
<ktt waveform="square" pitch=1.8>This is a mechanical override.<ktt pause=0.5><ktt waveform="saw" pitch=1.0>
[speed=3]Resuming normal protocol.[/speed]

See Also

For conceptual overviews and usage guides:

For component references:

  • Audio Manager – Manager lifecycle that the voice system participates in

Get GS_Audio

GS_Audio — Explore this gem on the product page and add it to your project.