The Future is Here: ChatGPT Can Now See, Hear, and Speak

Sep 27, 2023

Exciting times are ahead in the world of artificial intelligence. OpenAI has just announced that they are rolling out new voice and image capabilities for ChatGPT, taking the conversational AI to the next level. In the coming weeks, Plus and Enterprise users will be able to not only chat with the bot in writing, but also speak to it and show it images for more intuitive and natural conversations.

This is a huge leap forward in making AI assistants more useful, interactive and lifelike. The new features open up many more possibilities for how ChatGPT can be helpful in our daily lives. Soon we'll be able to have back-and-forth voice conversations with ChatGPT, take pictures of objects and ask it questions, and do things like getting recipe help by showing it photos of ingredients. The potential applications are endless.

Powerful New Voice Capability

One of the most exciting developments is the addition of voice capabilities. Users will now be able to opt into voice conversations with ChatGPT on iOS and Android mobile devices. To start a voice chat, simply tap the headphone icon and select one of five different natural sounding voices.

ChatGPT's new voices are powered by an advanced text-to-speech model developed in collaboration with professional voice actors. The AI can take just text input and a few seconds of speech sample and generate remarkably human-like voices. To understand your speech, it uses Whisper, OpenAI's open-source speech recognition system.

The voice interaction allows for a much more natural conversation flow compared to typing back and forth. You can now speak conversationally with ChatGPT on the go, ask it questions, request information hands-free, and more. It's like having a real human assistant that you can talk to!

Some fun examples of using the new voice feature:

- Ask for a bedtime story while getting ready for bed

- Settle debates during family dinner conversations

- Get tourist recommendations on a trip by speaking to ChatGPT as you walk around a new city

The voice capability also opens up new possibilities for accessibility. Those unable to type can now engage with ChatGPT through speech. The voice chatting makes the AI assistant far more interactive.

Realistic and Customizable Voices

OpenAI says that generating the realistic voices was no easy feat. They collaborated with voice actors to create samples and had to develop an advanced model capable of extrapolating from limited data to produce human-like voices.

Users will be able to pick from 5 different voice options with different tones and cadences. The voices were designed to sound natural - complete with the pauses, emotions, and inflections of human speech. You can really imagine having a conversation with an intelligent assistant.

Being able to customize the voice adds a personal touch. As the technology improves in the future, perhaps users will even be able to generate voices based on their own recording or a celebrity voice they'd like their AI companion to have!

Image Capabilities - Show ChatGPT What You See

In addition to voice, ChatGPT is also gaining the ability to understand and discuss images. Users can now send the bot one or more pictures to have an intelligent conversation about what's in the images.

Some examples of how you could use this:

- Snap a photo of a graph at work and ask ChatGPT to analyze the data

- Show a picture of your pantry and fridge and have it suggest recipe ideas based on what's available

- Send an image of a broken appliance or gadget you need help troubleshooting

- Take a snapshot of a piece of furniture you're assembling and get assembly assistance

- Discuss photos from your latest vacation to get history facts and context about landmarks

The image capabilities allow you to visually show ChatGPT what you're talking about, making the back-and-forth interaction more intuitive. If you want to draw attention to a certain part of an image, you can even use the drawing tool to circle or point out specific areas.

Behind the scenes, advanced multimodal AI models power ChatGPT's newfound ability to understand and discuss photos. The models apply language skills to make sense of both text and images, whether that's deciphering natural scenery, documents, screenshots, or more.

Gradual Responsible Rollout

As with any powerful new technology, OpenAI is taking care to roll out the voice and vision capabilities slowly and responsibly. Unlike many tech companies that might rush to release new innovations widely, OpenAI aims to deploy gradually. This allows them to continually improve safety, refine risk mitigation, and prepare society for more advanced AI.

The company recognizes that voice and image models come with potential risks if misused. For example, bad actors could leverage realistic voice synthesis for fraud. And vision systems could lead to issues like inappropriate monitoring if deployed irresponsibly into public spaces.

That's why for now, the capabilities are limited to 1-on-1 voice chats and discussing user-provided images. OpenAI can monitor risks closely in this controlled environment before expanding functionality.

Additionally, the company has implemented both technical and policy safety measures. For voice, they are only allowing the AI to mimic voices of contracted actors. The image abilities are restricted from making direct statements about individuals, to respect privacy and avoid potential harms.

OpenAI also emphasizes transparency about model strengths and weaknesses. For example, they advise users that ChatGPT has limited proficiency for non-English languages. And they discourage high-stakes use of its image analysis skills without human verification of accuracy.

Collaborating with Users

A key part of OpenAI's strategy is collaborating directly with users to guide ethical and beneficial innovation.

For example, they worked closely with Be My Eyes, an app that connects blind and low vision individuals with sighted volunteers. The partnership helped OpenAI understand use cases and limitations for the visual assistance features.

Be My Eyes users provided feedback that it's valuable for vision impaired individuals to be able to discuss images that happen to contain people, such as figuring out a TV remote. This real-world input shaped OpenAI's approach to enabling image conversations while restricting direct statements about individuals.

The company emphasizes that real-world testing and user feedback will continue helping them enhance protections while keeping the tools useful. They encourage users to provide open and honest input to help steer AI in a positive direction.

The Future of AI Assistants

ChatGPT's new voice and vision capabilities represent a major milestone in the evolution of AI. It's an exciting time as we see rapid advancement in the quest to make machines more well-rounded, useful, and accessible.

But it's also a critical juncture requiring diligent stewardship, as these powerful technologies come with risks like any new innovation. OpenAI's gradualist and collaborative approach provides a model of how companies can develop AI responsibly and avoid potentially dangerous consequences of rushing advances into the world before proper oversight.

User feedback will be invaluable in shaping how AI can provide the most benefit to society while minimizing harm. We all have a role to play in voicing not just what we want from AI, but also what we don't want - setting ethical guardrails aligned with human values.

OpenAI's latest strides with ChatGPT are just the beginning. If developed thoughtfully, even more advanced AI could one day revolutionize areas like science, education, healthcare and more for the betterment of all. But we must work together to chart a wise path forward.

How will you use ChatGPT's new voice and vision capabilities in your life? What other AI innovations are you excited or concerned to see in the future? Share your thoughts with us!

The Week In AI

Discussion about this post