Immersive VR Conversations with AI Avatars: Integrating ChatGPT, Google STT, and AWS Polly

Virtual Reality (VR) has opened up new frontiers in how we interact with technology. I recently had the opportunity to push those boundaries even further with a tech demo that integrates ChatGPT, Google Speech-to-Text (STT), and Amazon Web Services (AWS) Polly in a VR experience.

The result?

A truly immersive and interactive conversation with an AI-powered Ready Player Me avatar, driven by ChatGPT's responses and enriched with voice input and output capabilities.

The concept behind this tech demo was to create a virtual room where users can have realistic conversations with an AI avatar, powered by ChatGPT.

To take the experience to the next level, I integrated Google STT for voice input, which transcribes the user's speech into text. This text is then sent to a micro-service for processing and forwarded to ChatGPT for generating a relevant response. Once the response is generated, AWS Polly is used for text-to-speech (TTS) conversion, and the output is sent back to the avatar for voice processing, resulting in a mostly seamless and dynamic conversation.

One of the standout features of this tech demo is the integration of Ready Player Me avatars, with Lip Sync turned on. This means that as audio plays, the avatar’s mouth moves in sync with their speech, creating a highly realistic and interactive conversation experience. These avatars serve as the visual representation of the AI, adding a layer of immersion and personalization to the conversation.

To make the conversations engaging, I created three pre-filled prompt scenarios for ChatGPT.

In the first scenario, the AI plays the role of a financial representative, providing advice on managing finances and investments.

https://youtu.be/CtmqZEEH-mY?embedable=true

The second scenario involves the AI acting as a psychiatrist, providing virtual therapy and counseling.

https://youtu.be/QG4efRFH82E?embedable=true

Lastly, in the third scenario, the AI takes on the persona of a fantasy merchant, selling virtual gear and items.

https://youtu.be/r8DcUEcx_kQ?embedable=true

These scenarios provide a glimpse into the potential use cases of this technology in various domains, such as finance, mental health, and entertainment.

Although not talked about enough, prompt engineering is a talent in its own right. As you can see in the code, setting up a contextual scene and ensuring the avatar doesn’t lose character can be complicated. Essentially, we need to ensure the model doesn’t break the script but remains realistic. From the full videos above, you will find the Fantasy merchant occasionally breaks character and displays a repetitive, almost nervous tick of saying "well, well, well" while vocalizing their emotions.

Creating Believable Environments for Immersive VR Conversations

It's important to note that this tech demo primarily used off-the-shelf animations and models for the art direction. However, for a full-fledged application, investing in realistic animations, including talking animations with sentiment analysis for positive/negative animation tones, and filler animations during processing time, can enhance the believability and naturalness of the AI interactions. This will further elevate the immersive experience and make it more akin to human-like conversations.

One of the challenges in creating a truly immersive VR conversation experience is the limitations of our senses. In virtual environments, we typically rely on sight and sound to perceive and interact with the world. As these are the 2 senses that are engaged, you are hyper-aware when something in a scenario seems off. To make the virtual world feel more real and distract from the surreal nature of the environment, it's crucial to create believable surroundings that mimic real-world environments.

Visuals play a crucial role in creating a sense of presence in VR. Using realistic 3D models may help, but textures, lighting, and animations can create an environment that looks and feels like the real world even with stylized graphics. For example, if the AI avatar is placed in a virtual office, using accurate office furniture, decorations, and lighting can create a familiar environment that users can relate to, making the conversation feel more authentic.

Sound is another key element that adds to the immersion in VR conversations. Spatial audio, where the sound changes direction and intensity based on the user's position and head movements, can greatly enhance the sense of presence.

For instance, if the user hears the AI avatar's voice coming from the direction where the avatar is located, it adds to the realism of the conversation. However even more important than the sound of the avatar, is the white noise sound of the day-to-day. Sounds of an assistant rumbling papers, people shuffling outside, phones, etc. These white-noise-generating sounds are necessary to help mask any computation thinking and will help distract the user and keep them in a surreal immersion.

Watching the replays of the video interactions, they will all seem off. The environment was specifically crafted for debug overlays, and all background white noise was absent. If I were to focus on creating a realistic experience my focus areas would include; animations, sound design, set design, and prompt engineering. This would be the order of importance, prompt engineering would be last in my considerations as when you are the one talking to the AI, it can shock you at times how good it can be in predicting what it should say next, especially with a well-timed animation.

Conclusion - Taking on the Future

While this tech demo showcases the immense potential of integrating ChatGPT, Google STT, and AWS Polly in a VR experience, it also raises important ethical considerations. Ensuring that user data is handled securely and responsibly and that AI models are trained in a fair and unbiased manner, should be prioritized in the development and deployment of such technologies. As these interactions become more widely available, creating simulated virtual humans to trick personal information out of willing users may seem like something out of an episode of Black Mirror, but is quickly coming into the realm of possibility.

In conclusion, this tech demo represents a significant step forward in breaking boundaries in VR interactions with AI. The integration of ChatGPT, Google STT, and AWS Polly enables immersive and dynamic conversations, paving the way for exciting possibilities in domains such as education, customer service, and entertainment. With further advancements in animation and AI technologies, we can expect a future where virtual conversations with AI avatars become more natural, engaging, and mainstream. The potential of this technology is vast, and I am thrilled to see how it evolves and transforms our interaction with AI in the virtual world.

Links:

Github for Sigmund Microservice: https://github.com/goldsziggy/sigmund

Docker File for Microservice:

 docker run -it -p 8080:8080 --env-file .env matthewzygowicz/ms-sigmund

If enough interest is gathered, I can/will rewrite the Unity portion of the code using all open-source assets to open-source that as well.