Democratization of Text-to-video Models Will Launch Us Into a Hyperreal Cyberspace. Are We Ready?

I heard about AI-generated art for the first time in a Wall Street Journal podcast in 2021. In the podcast, the host was playing AI-generated pieces of classical music. These pieces emulated the style of classical composers, and the precision with which they did that was rather impressive. A listener uninitiated to classical music would probably have a hard time differentiating them from the actual works of those composers.

This was also the first time I took generative AI (GenAI) seriously. Listening to those pieces made me wonder – what if AI gets so good at making music that those renowned composers eventually get buried under piles of AI-generated classical music? I have always believed in judging a work of art for what it is instead of finding its aesthetic validity in the legacy of a musician or how critics received a piece of music. If AI-generated music is going to get better, I will most likely say, Sorry, Chopin.

Last week, a colleague showed me OpenAI Sora, a text-to-video model that generates remarkably detailed, realistic videos from text prompts. The first video they showcased on the website featured a woman (with five fingers, not four) walking through the streets of Tokyo. The texture of her skin, the blemishes and spots on her face, the wrinkles, and her hair – everything about this woman in focus is more realistic than (or at least, as much as) subjects in those videos that were once ‘really shot’ by professional videographers.

Are we entering the era of the hyperreal?

In his book Simulacra and Simulation, Jean Baudrillard first defined the concept of hyperreal as the blurring of the lines that separate reality from representation. This blurring is caused by the creation of representations (signs) that do not refer to anything that exists in reality. Hyperreality results in our inability to distinguish what is simulated from what is real.

In his text, Baudrillard was alluding to the process by which value gets created in consumerist paradigms. But with GenAI, his idea is being stretched to its literal limits. On the OpenAI Sora page are over 30 video clips, each containing chunks of time that never happened, all in great detail, ready for us to witness with our eyes. What do we see when we see ‘historical footage of California during the gold rush’ generated by Sora?

In truth, there is nothing historical about this footage, and neither is it footage in the true sense. The historical effect is rendered by employing visual aesthetics that elicit a sense of historicity in films and documentaries – warm tones, vintage overlay, grainy effects, and a lower frame rate. Yet, the prompt results in a very believable "footage." I am no historian, so I cannot say if the clip gets historical details right in how it renders the look of the buildings and the attire of the people on horses. But the geographical features look rather compelling.

But this is not the crux of the issue here. Such visual effects have been achieved by studios using CGI for decades now. But these endeavors were applied to a specific end and within particular contexts – like creating a realistic game, a movie sequence, or making a documentary more tangible, if you will. The content was vetted by experts who could verify its correctness and authenticity. But now that this technology (text-to-video models) will democratize access to such capabilities, the consequences will be very different. Content will be affected by this democratization of GenAI long before it starts weaving symphonies that draw tears from our eyes or write Nobel-winning novels.

A new distinction between content and art

Content and art have coexisted ever since the advent of media – writing and painting, followed by print and photography, then audio and video. Content was usually short-lived and created for a specific function, such as entertaining, delivering news, capturing memories, recording events, or even selling things. Art served a higher purpose in the human being’s socio-cultural experience, and artworks survived well beyond their time. However, both content and art had to be created before GenAI entered the picture.

Until now, art – or at least, what is deemed art in ‘high culture’ has remained largely untouched by GenAI capabilities. While GenAI may be able to emulate Mozart’s musical style, the aesthetic and cultural value of that style was rooted in the time and context in which Mozart was creating music. The same goes for literary works, which, in effect, created a paradigm shift in the ways of thinking about the world, human beings, society, and the small and big questions underpinning our existence during a period in history.

The story of the content is somewhat different. Until some point in the 2000s, content creation required specialized apparatus – sophisticated video cameras or DSLRs and teams of people were needed to create audio-visual content, for example. But with social media and the proliferation of smartphones, content creation took a new turn. Everyone was now a creator, snipping pictures (or videos) of their dogs and cats doing funny stuff, creating video memes, or making comic faces with Instagram filters. Compilations of these snippets of human existence made their way to YouTube in droves, and content flooded depthless Instagram, Facebook, and TikTok feeds. Yet, few pieces of content warranted the need to question the reality of the subject of these videos.

With the democratization of GenAI, we will find ourselves asking this question more often. In fact, while scrolling through the comments section of the Apple Airpods Pro 2 launch live stream in 2023, I found a user wondering whether the woman on screen, who was presenting the features of the new Airpods, was a real person. However, the industry is attempting to solve this problem with standards like the C2PA, which will embed details like the creator and mode of creation within media files. This will make it possible to differentiate AI-generated content from ‘real’ content.

Nonetheless, AI-generated content will coexist with the content that is created by people. This will add another level of distinction between content and art – which is that art must be created, but content can be generated.

The ability to generate content will undoubtedly have far-reaching economic consequences and leave a profound impact on many professions. But it will also alter our relationship with content and how it informs our worldview.

The desert of the real?

What if generated content is more compelling, more convincing, more entertaining, and more appealing to the viewer? Given how most societies today are inclined to promote value-creating systems, generated content will likely overtake content created by people in such a scenario – which means that content platforms will be saturated with AI-generated content that will show places that never existed, people who never walked on the planet, events that never occurred, or cute dog stuff that no real dog ever did.

This is quite likely to happen – AI thought leader Nina Schick says 90% of online content could be AI-generated. But the proliferation of AI-generated content could go two ways.

According to neuroscientists, knowing an artwork as authentic activates reward pathways in the brain of viewers. Forged or fake art, on the other hand, does not lead to the same effect. If this result is extrapolated to AI-generated content, one would be led to conclude that AI-generated content will not be as fulfilling as that which is created by real people recording real things – given that content platforms enforce rules that enable viewers to identify AI-generated content.

Will AI-generated dog videos be as effective in reducing our stress levels as the footage of real dogs doing real things? If not, then the adoption of text-to-video models will be limited to certain areas, like making clips for purchasable stock video collections. Or generating elements of videos, like backgrounds or animated overlays.

In this scenario, we will see the return of content that is created rather than generated – and social media platforms would be the first sites where this shift will unfold.

But there is another factor that must be taken into account here. Research on Instagram usage patterns has shown that upward comparison (comparing oneself to those who people perceive as superior to them) through content consumption has detrimental effects on users' mental health. If AI-generated content breaks through the barriers of creative possibilities, it will present a version of the world that is superior to its real counterpart. If this happens, will AI-generated content immerse us in a hyperreal world, framing our existing one into a dull and dry reality?

Probably not – at least not right away. Current text-to-video models still have limitations. For instance, the wolf pups in Sora-generated clips sometimes emerge out of nothing, and bodies sometimes magically pass through other bodies. In another clip, a plastic chair sometimes behaves like a sheet of paper and changes shape through the clip. OpenAI claims that giving the model foresight into multiple frames enables them to retain the visual identity of a subject even when it goes out of view for a few moments. But this doesn't make the model infallible, as is evident in the videos on OpenAI’s Sora page. Moreover, the generated clips are still only 60 seconds long, and it will probably be a while before we see full-length feature films and documentaries generated by such models.

However, these limitations will likely be eliminated in upcoming generations of multimodal models – and soon enough, considering that the technology has been in development for a little over half a decade. When that time comes, the experience of consuming content or simply navigating through cyberspace will take a new turn. For the better or worse? Only time will tell.

Looking towards a brighter future

Generation and creation remain very distinct ideas, even in the era of GenAI. This distinction is rooted in the etymology of these words – create comes from the Latin creare, which means to form out of nothing. On the other hand, generate comes from the Latin generō (meaning to produce), which in turn comes from genus (meaning a type or a family). In both mathematics and linguistics, generation pertains to a process of applying a set of rules to an input to produce something. This is precisely what Generative Pre-Trained Transformers (GPTs) do.

As far-fetched as the meaning of creation may seem in the era of generated content, GenAI is compelling us to exercise our higher cognitive abilities as it abstracts the heavy lifting of content generation from the creators. Creation, at its heart, remains a process of making something out of nothing. It is this idea of creation that has ushered era-defining epochs in the course of our cultural, social, and technological history.

With GenAI solutions undertaking the less cognitive aspects of content creation, each of us will be empowered to create more sophisticated things, just better and faster. But where this paradigm of GenAI-augmented creation takes us will be determined collectively by content-powered platforms, regulators, content provenance standards and their endorsement, and our inevitable neurological responses to new forms of content.