VOICE INTELLIGENCE · BUILT FOR CARE

Voice that registers how someone feels, not only what they say.

CoSpeak hears the signal beneath the words: pitch, cadence, the texture of a voice. It infers emotional state in real time and answers in a tone that fits the moment. English and Spanish, built for clinical settings.

Spectral readout LIVE · 16 kHz
0 HzFREQUENCY · TIME →4 kHz
F0 pitch
174 Hz
Rate
3.4 syl/s
Jitter
0.8%
HNR
18 dB
Inferred affect
calm
Measured, unhurried. Mirrors the settled pace and keeps turns open.
01 / THE SENSE-RESPOND LOOP

One closed loop, from the sound of a voice to a voice that answers.

Most voice systems treat speech as a transcript to be parsed. CoSpeak treats it as a signal carrying emotional information the words alone discard. Four stages run through every turn.

/01Listen

Acoustic features

Paralinguistic cues pulled from the raw waveform, independent of the words. Pitch contour, speaking rate, voice quality, spectral shape.

eGeMAPSopenSMILEF0 / jitter
/02Infer

Affect inference

Self-supervised speech models place the utterance on a continuous valence and arousal plane, not one of a few rigid labels.

wav2vec 2.0HuBERTWavLM
/03Respond

Affect-aware dialogue

The reply is shaped by the inferred state. The system decides when to mirror, when to complement, and when to validate before offering anything else.

EmpatheticDialoguesESConv
/04Speak

Emotional TTS

The response is voiced with prosody that fits the register. Pace, warmth, and pitch are set deliberately, never flattened into one neutral tone.

controllable prosody2 languages
PILOT FINDINGS · COSPEAK & EMPATHIA · REAL-WORLD HEALTHCARE SETTINGS
0%
Speech recognition accuracy across English and Spanish
0%
Emotion recognition accuracy on evaluation set
0%
Patient satisfaction reported in pilot intake
0%
Reduction in patient intake time

Figures are from controlled pilot deployments and an evaluation set, not population-level claims. Accuracy varies by language, recording conditions, and clinical context. Reported here for transparency, with full methodology available on request.

02 / THE PLATFORM

Four systems on one acoustic foundation.

Each product applies the same sense-respond loop to a different communication problem. The shared layer reads the voice; the difference is what each one does with what it hears.

Conversational agent

CoSpeak

Bilingual voice-to-voice interaction

The core agent. Real-time recognition, healthcare-domain understanding, and emotionally adaptive spoken replies in English and Spanish. It answers from uploaded guidelines and patient context, in the register the moment calls for.

Patient intake

Empathia

Empathetic, adaptive intake system

Voice and video analytics applied to first contact. Empathia reads emotional state during intake, adapts its questioning, and cut intake time by roughly a third in pilot while raising reported patient satisfaction.

Clinical documentation

Scribe

Audio to compliant record, any practice

A template-driven scribe that turns a recorded encounter into a structured, compliance-ready document. Practice-specific formats, terminology recognition, an audit trail on every edit.

platform.cospeak.ai →
Emotional support · non-diagnostic

Companion

Voice support for moments of stress

A conversational companion a person can talk to during stress or in ordinary conversation. It understands emotional state through voice and responds with appropriate support. It does not diagnose, label, or screen for any condition by design.

03 / THE RESEARCH

The science under the product.

The loop is built on a defined stack at each stage, drawn from peer-reviewed speech and affective-computing work and validated in our own deployments.

Inbound · reading the voice

Affective information lives in pitch, cadence, voice quality, and spectral shape long before it reaches the words. We extract a defined acoustic feature set and pass it to self-supervised speech representations rather than relying on transcripts alone.

eGeMAPSComParEopenSMILEwav2vec 2.0HuBERTWavLM

Outbound · shaping the reply

Response generation is grounded in empathetic-dialogue and emotional-support corpora, then voiced through controllable emotional text-to-speech so prosody carries the same intent as the wording. Mirror or complement is an explicit decision, not an accident of sampling.

EmpatheticDialoguesESConvemotional TTSaudio-native SLM

Dimensional, not categorical

Emotion is modeled on a continuous valence and arousal plane rather than sorted into a handful of fixed labels. This holds up better across languages, where the same felt state surfaces through different acoustic patterns.

valence / arousalcross-lingual

Where the field is moving

Real-time on-device affect sensing, multilingual emotion recognition, longitudinal tracking, and tight coupling with large language models. We track this landscape closely, including the line the EU AI Act draws around emotion recognition.

on-devicelongitudinalEU AI Act
Empathetic Multilingual Voice Agents Powered by Generative AI in Healthcare: Development and Implementation of CoSpeak and Empathia
Bora, S. · pilot study · ASR 95% · emotion recognition 88% · patient satisfaction 90%
Request the paper
04 / LANGUAGES

Two languages, each read on its own terms.

English and Spanish, with emotion inference tuned to the prosody of each. The model is not ported from one to the other; the acoustic conventions differ, so the reading does too.

English
en · US / global
96%
ASR accuracy

Tuned on clinical and intake speech across accents. Stress-timed rhythm means arousal shows up first in tempo and loudness, so the reading weights rate and energy dynamics heavily.

Prosodic cue
Falling pitch + slowing rate → settling, de-escalation.
Español
es · LatAm / ES
95%
ASR accuracy

Trained for the wider pitch range and syllable-timed rhythm of spoken Spanish. The same arousal reads differently, so pitch contour and vowel duration are weighted to avoid mistaking expressiveness for distress.

Prosodic cue
Tono sostenido + ritmo rápido → activación, ansiedad.

Cross-lingual emotion recognition is an open research problem precisely because expressiveness is cultural. Treating each language on its own acoustic terms is the difference between a system that reads feeling and one that mislabels a lively speaker as agitated.

POSITION · NON-DIAGNOSTIC BY DESIGN

We read emotion to respond to it, not to diagnose.

CoSpeak interprets emotional state the way a person registers tone in a conversation, and uses that to respond appropriately. It does not identify, label, or screen for any clinical condition. That boundary is a deliberate design and product decision, not a missing feature.

The distinction matters. Diagnostic voice-biomarker products have faced hard reckonings, including notable shutdowns in 2026, while emotion-recognition systems sit under tightening regulation such as the EU AI Act. Building for emotional understanding and supportive response, rather than diagnosis, keeps the system useful, defensible, and honest about what voice can and cannot reliably tell us.

Put an emotion-aware voice in front of the people you serve.

Pilots run in clinical, intake, and support settings. Bring a workflow; we will show you where the loop fits.