Klatt Voice Synthesis
Categories:
The Klatt Voice Synthesis system provides custom text-to-speech for GS_Play projects using Klatt formant synthesis with full 3D spatial audio. It uses SoLoud internally for speech generation and MiniAudio for spatial positioning.
The system has two layers:
- KlattVoiceSystemComponent – A singleton that manages the shared SoLoud engine instance and tracks the 3D audio listener position.
- KlattVoiceComponent – A per-entity component that generates speech, queues segments, applies voice profiles, and emits spatialized audio from the entity’s position.
Voice characteristics are defined through KlattVoiceProfile assets containing frequency, speed, waveform, formant, and phoneme mapping configuration. Phoneme maps convert input text to ARPABET phonemes for the Klatt synthesizer, with support for custom pronunciation overrides.
For usage guides and setup examples, see The Basics: GS_Audio.

Contents
Components
KlattVoiceSystemComponent
Singleton component that manages the shared SoLoud engine and 3D listener tracking.
| Field | Value |
|---|---|
| TypeId | {F4A5D6E7-8B9C-4D5E-A1F2-3B4C5D6E7F8A} |
| Extends | AZ::Component, AZ::TickBus::Handler |
| Bus | KlattVoiceSystemRequestBus (Single/Single) |
KlattVoiceComponent
Per-entity voice component with spatial audio, phoneme mapping, and segment queue.
| Field | Value |
|---|---|
| TypeId | {4A8B9C7D-6E5F-4D3C-2B1A-0F9E8D7C6B5A} |
| Extends | AZ::Component, AZ::TickBus::Handler |
| Request Bus | KlattVoiceRequestBus (Single/ById, entity-addressed) |
| Notification Bus | KlattVoiceNotificationBus (Multiple/Multiple) |
API Reference
Request Bus: KlattVoiceSystemRequestBus
System-level voice management. Singleton bus – Single address, single handler.
| Method | Parameters | Returns | Description |
|---|---|---|---|
GetSoLoudEngine | – | SoLoud::Soloud* | Returns a pointer to the shared SoLoud engine instance. |
SetListenerPosition | const AZ::Vector3& position | void | Updates the 3D audio listener position for spatial voice playback. |
SetListenerOrientation | const AZ::Vector3& forward, const AZ::Vector3& up | void | Updates the 3D audio listener orientation. |
GetListenerPosition | – | AZ::Vector3 | Returns the current listener position. |
IsEngineReady | – | bool | Returns whether the SoLoud engine has been initialized and is ready. |
Request Bus: KlattVoiceRequestBus
Per-entity voice synthesis controls. Entity-addressed bus – Single handler per entity ID.
| Method | Parameters | Returns | Description |
|---|---|---|---|
Speak | const AZStd::string& text | void | Converts text to speech and plays it. Uses the component’s configured voice profile. |
SpeakWithParams | const AZStd::string& text, const KlattVoiceParams& params | void | Converts text to speech using the specified voice parameters instead of the profile defaults. |
StopSpeaking | – | void | Immediately stops any speech in progress and clears the segment queue. |
IsSpeaking | – | bool | Returns whether this entity’s voice is currently producing speech. |
QueueSegment | const AZStd::string& text | void | Adds a speech segment to the queue. Queued segments play in order after the current segment finishes. |
ClearQueue | – | void | Clears all queued speech segments without stopping current playback. |
SetVoiceProfile | const AZ::Data::Asset<KlattVoiceProfile>& profile | void | Changes the voice profile used by this component. |
GetVoiceProfile | – | AZ::Data::Asset<KlattVoiceProfile> | Returns the currently assigned voice profile asset. |
SetSpatialConfig | const KlattSpatialConfig& config | void | Updates the 3D spatial audio configuration for this voice. |
GetSpatialConfig | – | KlattSpatialConfig | Returns the current spatial audio configuration. |
SetVolume | float volume | void | Sets the output volume for this voice (0.0 to 1.0). |
GetVolume | – | float | Returns the current output volume. |
Notification Bus: KlattVoiceNotificationBus
Events broadcast by voice components. Multiple handler bus – any number of components can subscribe.
| Event | Parameters | Description |
|---|---|---|
OnSpeechStarted | const AZ::EntityId& entityId | Fired when an entity begins speaking. |
OnSpeechFinished | const AZ::EntityId& entityId | Fired when an entity finishes speaking (including all queued segments). |
OnSegmentStarted | const AZ::EntityId& entityId, int segmentIndex | Fired when a new speech segment begins playing. |
OnSegmentFinished | const AZ::EntityId& entityId, int segmentIndex | Fired when a speech segment finishes playing. |
Data Types
KlattVoiceParams
Core voice synthesis parameters controlling the Klatt formant synthesizer output.
| Field | Value |
|---|---|
| TypeId | {8A9C7F3B-4E2D-4C1A-9B5E-6D8F9A2C1B4E} |
| Field | Type | Description |
|---|---|---|
| Base Frequency | float | Fundamental frequency (F0) in Hz. Controls the base pitch of the voice. |
| Speed | float | Speech rate multiplier. 1.0 is normal speed. |
| Declination | float | Pitch declination rate. Controls how pitch drops over the course of an utterance. |
| Waveform | KlattWaveform | Glottal waveform type used by the synthesizer. |
| Formant Shift | float | Shifts all formant frequencies up or down. Positive values raise pitch character, negative values lower it. |
| Pitch Variance | float | Amount of random pitch variation applied during speech for natural-sounding intonation. |
KlattVoiceProfile
A voice profile asset combining synthesis parameters with a phoneme mapping.
| Field | Value |
|---|---|
| TypeId | {2CEB777E-DAA7-40B1-BFF4-0F772ADE86CF} |
| Reflection | Requires GS_AssetReflectionIncludes.h — see Serialization Helpers |
| Field | Type | Description |
|---|---|---|
| Voice Params | KlattVoiceParams | The synthesis parameters for this voice profile. |
| Phoneme Map | AZ::Data::Asset<KlattPhonemeMap> | The phoneme mapping asset used for text-to-phoneme conversion. |
KlattVoicePreset
A preset configuration for quick voice setup.
| Field | Value |
|---|---|
| TypeId | {2B8D9E4F-7C6A-4D3B-8E9F-1A2B3C4D5E6F} |
| Field | Type | Description |
|---|---|---|
| Preset Name | AZStd::string | Display name for this preset. |
| Profile | KlattVoiceProfile | The voice profile configuration stored in this preset. |
KlattSpatialConfig
3D spatial audio configuration for voice positioning.
| Field | Value |
|---|---|
| TypeId | {7C9F8E2D-3A4B-5F6C-1E0D-9A8B7C6D5E4F} |
| Field | Type | Description |
|---|---|---|
| Enable 3D | bool | Whether this voice uses 3D spatialization. When false, audio plays as 2D. |
| Min Distance | float | Distance at which attenuation begins. Below this distance the voice plays at full volume. |
| Max Distance | float | Distance at which the voice reaches minimum volume. |
| Attenuation Model | int | The distance attenuation curve type (linear, inverse, exponential). |
| Doppler Factor | float | Intensity of the Doppler effect applied to this voice. 0.0 disables Doppler. |
KlattPhonemeMap
Phoneme mapping asset for text-to-ARPABET conversion with custom overrides.
| Field | Value |
|---|---|
| TypeId | {F3E9D7C1-2A4B-5E8F-9C3D-6A1B4E7F2D5C} |
| Reflection | Requires GS_AssetReflectionIncludes.h — see Serialization Helpers |
| Field | Type | Description |
|---|---|---|
| Base Map | BasePhonemeMap | The base phoneme dictionary to use as the foundation for conversion. |
| Overrides | AZStd::vector<PhonemeOverride> | Custom pronunciation overrides for specific words or patterns. |
PhonemeOverride
A custom pronunciation rule that overrides the base phoneme map for a specific word or pattern.
| Field | Value |
|---|---|
| TypeId | {A2B5C8D1-4E7F-3A9C-6B2D-1F5E8A3C7D9B} |
| Field | Type | Description |
|---|---|---|
| Word | AZStd::string | The word or pattern to match. |
| Phonemes | AZStd::string | The ARPABET phoneme sequence to use for this word. |
Enumerations
KlattWaveform
Glottal waveform types available for the Klatt synthesizer.
| Field | Value |
|---|---|
| TypeId | {8ED1DABE-3347-44A5-B43A-C171D36AE780} |
| Value | Description |
|---|---|
Saw | Sawtooth waveform. Bright, buzzy character. |
Triangle | Triangle waveform. Softer than sawtooth, slightly hollow. |
Sin | Sine waveform. Pure tone, smooth and clean. |
Square | Square waveform. Hollow, reed-like character. |
Pulse | Pulse waveform. Variable duty cycle for varied timbres. |
Noise | Noise waveform. Breathy, whisper-like quality. |
Warble | Warble waveform. Modulated tone with vibrato-like character. |
BasePhonemeMap
Available base phoneme dictionaries for text-to-ARPABET conversion.
| Field | Value |
|---|---|
| TypeId | {D8F2A3C5-1B4E-7A9F-6D2C-5E8A1B3F4C7D} |
| Value | Description |
|---|---|
SoLoud_Default | The default phoneme mapping built into SoLoud. Covers standard English pronunciation. |
CMU_Full | The full CMU Pronouncing Dictionary. Comprehensive English phoneme coverage with over 130,000 entries. |
KTT Voice Tags
KTT (Klatt Text Tags) are inline commands embedded in strings passed to KlattVoiceComponent::SpeakText. They are parsed by KlattCommandParser::Parse and stripped from the spoken text before synthesis begins — they are never heard.
Format: <ktt attr1=value1 attr2=value2>
Multiple attributes can be combined in a single tag. Attribute names are case-insensitive. String values may optionally be wrapped in quotes. An empty value (e.g. speed=) resets that parameter to the voice profile default.
speed=X
Override the speech speed multiplier from this point forward.
| Range | 0.1 – 5.0 |
| Default reset | speed= (restores profile default) |
| 1.0 | Normal speed |
Normal speech <ktt speed=2.0> fast bit <ktt speed=> back to default.
decl=X / declination=X
Pitch declination — how much pitch falls over the course of the utterance. Both decl and declination are accepted.
| Range | 0.0 – 1.0 |
| 0.0 | Steady pitch (no fall) |
| 0.8 | Strong downward drift |
Rising <ktt decl=0.0> steady <ktt decl=0.8> falling voice.
waveform="TYPE"
Change the glottal waveform used by the synthesizer, setting the overall character of the voice.
| Value | Character |
|---|---|
saw | Default, neutral voice |
triangle | Softer, smoother |
sin / sine | Pure tone, robotic |
square | Harsh, mechanical |
pulse | Raspy, textured |
noise | Whispered, breathy |
warble | Wobbly, character voice |
<ktt waveform="noise"> whispered section <ktt waveform="saw"> normal voice.
vowel=X
First formant (F1) frequency multiplier. Shifts the quality of synthesised vowel sounds.
| 1.0 | Normal |
| > 1.0 | More open vowel quality |
| < 1.0 | More closed vowel quality |
<ktt vowel=1.4> different vowel colour here.
accent=X
Second formant (F2) frequency multiplier. Shifts accent or dialect colouration.
| 1.0 | Normal |
| < 1.0 | Shifted accent colouring |
<ktt accent=0.8> shifted accent here.
pitch=X
F0 pitch variance amount. Controls how much pitch varies during synthesis.
| 1.0 | Normal variance |
| > 1.0 | More expressive intonation |
| < 1.0 | Flatter, more monotone |
<ktt pitch=2.0> very expressive speech <ktt pitch=0.1> flat monotone.
pause=X
Insert a pause of X seconds at this position in the voice playback. Value is required — there is no default.
Hello.<ktt pause=0.8> How are you?
Combined Example
Dialogue string using typewriter text commands and KTT voice tags together:
[b]Warning:[/b] [color=#FF0000]do not[/color] proceed.[pause=1]
<ktt waveform="square" pitch=1.8>This is a mechanical override.<ktt pause=0.5><ktt waveform="saw" pitch=1.0>
[speed=3]Resuming normal protocol.[/speed]
See Also
For conceptual overviews and usage guides:
- GS_Audio Overview – Top-level audio gem documentation
- The Basics: GS_Audio – Scripting-level usage guide
For component references:
- Audio Manager – Manager lifecycle that the voice system participates in
Get GS_Audio
GS_Audio — Explore this gem on the product page and add it to your project.