Current AI voice interactions often resemble awkward exchanges on outdated walkie-talkies, requiring strict alternation between speakers to proceed smoothly.
For instance, a user might say, 'ChatGPT, discuss films with me!' followed by the AI responding, 'Alright, which film interests you?' Although explicit signals like 'over' aren't needed in sessions with tools such as ChatGPT or Gemini, the underlying dynamic mirrors that rigidity.
These systems impose even greater constraints than traditional radio talks. The AI pauses completely during user speech, oblivious to external factors like time passing, and focuses solely on response generation when speaking, ignoring other stimuli. Essentially, it amounts to text-based dialogue enhanced with audio, which explains my infrequent use of the feature.
A fresh wave of AI systems designed for dynamic engagement promises to transform this by mirroring natural dialogue rhythms, including the ability to interject during ongoing user input.
Thinking Machines, a company started by Mira Murati, previously an executive at OpenAI, has created these engagement-focused AIs. Unlike conventional linear models that halt processing while inputting or outputting, these versions use a multi-threaded, short-interval setup to handle visuals and audio continuously during listening, enabling timely interventions based on content.
Video demonstrations from Thinking Machines highlight the preview-stage system's real-time responses in video calls, such as recognizing objects shown by participants and tracking mentions of animals like deer or sheep amid continuous speech. The AI also demonstrates patience in one scenario, holding back instead of interrupting as the user pauses briefly to drink coffee mid-thought.
In a separate showcase, the system interjects on cue, fixing a speaker's mispronunciation of 'acai' and disputing her claim that acai bowls hail from Argentina. While potentially intrusive, this illustrates the AI's capacity to engage actively while processing input, avoiding passive waiting.
The secret behind Thinking Machines' method involves dual AI components: a primary engagement unit that stays actively involved, managing inputs and outputs in quick 200-millisecond segments, paired with a secondary unit that tackles intricate computations and passes outcomes to the faster one when complete.
These interactive AI prototypes remain under development, with no personal experience yet available. The firm acknowledges challenges in extended discussions and the need for stable internet links. The present engagement model is relatively compact, as bigger ones prove too sluggish for real-time demands.
Nevertheless, Thinking Machines' full-duplex framework might fundamentally improve AI audio dialogues, shifting them toward fluid, lifelike exchanges rather than clunky, dated volleys.