electronic eye

ChatGPT update enables its AI to “see, hear, and speak,” according to OpenAI

Image recognition and voice features aim to make the AI bot's interface more intuitive.

Benj Edwards – Sep 25, 2023 6:38 PM | 92

An illustration of a cybernetic eyeball. Credit: Getty Images

On Monday, OpenAI announced a significant update to ChatGPT that enables its GPT-3.5 and GPT-4 AI models to analyze images and react to them as part of a text conversation. Also, the ChatGPT mobile app will add speech synthesis options that, when paired with its existing speech recognition features, will enable fully verbal conversations with the AI assistant, OpenAI says.

OpenAI is planning to roll out these features in ChatGPT to Plus and Enterprise subscribers "over the next two weeks." It also notes that speech synthesis is coming to iOS and Android only, and image recognition will be available on both the web interface and the mobile apps.

OpenAI says the new image recognition feature in ChatGPT lets users upload one or more images for conversation, using either the GPT-3.5 or GPT-4 models. In its promotional blog post, the company claims the feature can be used for a variety of everyday applications: from figuring out what's for dinner by taking pictures of the fridge and pantry, to troubleshooting why your grill won’t start. It also says that users can use their device's touch screen to circle parts of the image that they would like ChatGPT to concentrate on.

On its site, OpenAI provides a promotional video that illustrates a hypothetical exchange with ChatGPT where a user asks how to raise a bicycle seat, providing photos as well as an instruction manual and an image of the user's toolbox. ChatGPT reacts and advises the user how to complete the process. We have not tested this feature ourselves, so its real-world effectiveness is unknown.

Ars Video

So how does it work? OpenAI has not released technical details of how GPT-4 or its multimodal version, GPT-4V, operate under the hood, but based on known AI research from others (including OpenAI partner Microsoft), multimodal AI models typically transform text and images into a shared encoding space, which enables them to process various types of data through the same neural network. OpenAI may use CLIP to bridge the gap between visual and text data in a way that aligns image and text representations in the same latent space, a kind of vectorized web of data relationships. That technique could allow ChatGPT to make contextual deductions across text and images, though this is speculative on our part.

Meanwhile in audio land, ChatGPT's new voice synthesis feature reportedly allows for back-and-forth spoken conversation with ChatGPT, driven by what OpenAI calls a "new text-to-speech model," although text-to-speech has been solved for a long time. Once the feature rolls out, the company says that users can engage the feature by opting in to voice conversations in the app's settings and then selecting from five different synthetic voices with names like "Juniper," "Sky," "Cove," "Ember," and "Breeze." OpenAI says these voices have been crafted in collaboration with professional voice actors.

OpenAI's Whisper, an open source speech recognition system we covered in September of last year, will continue to handle the transcription of user speech input. Whisper has been integrated with the ChatGPT iOS app since it launched in May. OpenAI released the similarly capable ChatGPT Android app in July.

“ChatGPT is not always accurate”

When OpenAI announced GPT-4 in March, it showcased the AI model's "multimodal" capabilities that purportedly allow it to process both text and image input, but the image feature remained largely off-limits to the public during a testing process. Instead, OpenAI partnered with Be My Eyes to create an app that could interpret photos of scenes for blind persons. In July, we reported that privacy issues prevented OpenAI's multimodal features from release until now. Meanwhile, Microsoft less cautiously added image recognition capability to Bing Chat, an AI assistant based on GPT-4, in July.

In its recent ChatGPT update announcement, OpenAI points out several limitations to the expanded features of ChatGPT, acknowledging issues that range from the potential for visual confabulations (i.e., misidentifying something) to the vision model's less-than-perfect recognition of non-English languages. The company says it has conducted risk assessments "in domains such as extremism and scientific proficiency" and sought input from alpha testers but still advises caution on its use, especially in high-stakes or specialized contexts like scientific research.

Informed by the privacy issues encountered while working on the aforementioned Be My Eyes app, OpenAI notes that it has taken "technical measures to significantly limit ChatGPT’s ability to analyze and make direct statements about people since ChatGPT is not always accurate and these systems should respect individuals’ privacy."

Despite their drawbacks, in marketing materials, OpenAI is billing these new features as giving ChatGPT the ability to "see, hear, and speak." Not everyone is happy about the anthropomorphism and potential hype language involved. On X, Hugging Face AI researcher Dr. Sasha Luccioni posted, "The always and forever PSA: stop treating AI models like humans. No, ChatGPT cannot 'see, hear and speak.' It can be integrated with sensors that will feed it information in different modalities."

While ChatGPT and its associated AI models are clearly not human—and hype is a very real thing in marketing—if the updates perform as shown, they potentially represent a significant expansion in capabilities for OpenAI's computer assistant. But since we have not evaluated them yet, that remains to be seen.

We'll keep you updated with new developments as the new features roll out widely in the coming weeks. In the meantime, OpenAI says the delay is for a good reason: "We believe in making our tools available gradually," they write, "which allows us to make improvements and refine risk mitigations over time while also preparing everyone for more powerful systems in the future."

Listing image: Getty Images

Benj Edwards Senior AI Reporter

Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

92 Comments