How we Rendered 3 Years of Video Using JavaScript

A big part of our product at Lumen5 is our video rendering engine (we're a video creation tool, after all). A couple of years ago, we started experimenting with WebGL as a potential new way to render videos. It's totally transformed the way we think about JavaScript development, and opened our eyes to a completely different web paradigm.

When we first started, it wasn't clear that WebGL was the way to go for us. It took a lot of research into the different alternatives before we broke ground on our existing render engine. I want to share why we thought WebGL was the best alternative for us in our situation.

The Problem and Requirements

Our users come to our website and interact with our video creator application to set up a series of scenes (which are combinations of text, images, videos, animations, audio, etc). When they are happy with their set of scenes, they click "render". We needed something that would take that data and produce a final mp4 file.

The basic requirement: Take in some data (in the form of JSON + media assets) and produce an mp4 output.

Other requirements:

The system should be scalable to hundreds of thousands of users creating videos
The system shouldn't cost too much to run (we understand that processing video is expensive, but let's try to minimize where possible)
The system should be able to render a video fast: it should be able to render a video faster than real-time (where one minute of video should render in less than one minute)
The system should be able to show a user a preview of their video in real-time while they are creating it
The system should allow us to have complete control over the rendered output (we should be able to control each pixel!)

Given these requirements, we set about analyzing a couple of alternatives.

Option 1: A Proprietary Tool, Repurposed as a Render Farm

The idea: There are plenty of video creation tools to help designers create videos and then render those videos. An easy example, which I focus my analysis on, is Adobe After Effects. We could piggy-back on After Effects, using it to be the backend rendering engine for our users.

The main problem: I experimented with this idea quite a bit, and it turns out that there isn't one main problem, there is a multitude of them! Here are a few:

There aren't great APIs to interact with, so you don't have very much control over the video, the rendering settings, etc. For each feature you create, you've got to hack around the pre-existing system in ActionScript (a primitive version of JavaScript).
Performance is poor. After Effects is designed for a single user to spend a long time configuring their video, then clicking the "render" button, and getting a coffee for the hour (or more) that the render takes. So getting this system to support hundreds of thousands of users is super challenging.
The developer experience is poor. There is no testing framework that you can use provided by Adobe, not a large open source community around ActionScript (in comparison to JavaScript), and the documentation is sparse.

As you can see, this turned out to be one of the worst alternatives that we looked at.

Option 2: An Open-Source Tool, Repurposed as a Render Farm

The idea: Rather than using a proprietary video tool, use an open source one. The example that I spent the most time looking at was Blender. Blender has a Python API and our backend is already written in Python, so this makes good sense as a solution. Also, since Blender is open source, it'd be possible to fork it and create a totally custom version of it suited to just our needs, if we need ultimate control.

The main problem: The main reason we didn't end up going with this approach is because of our preview requirement. We want the ability for users to preview their video in real-time as they are editing it. Since our users are using a web browser, it'd be really complicated for us to be generating a dynamic preview on our server in Blender and then streaming that preview down to our end user's browser at scale. Maybe this is possible, but when we thought of this issue, we decided to explore some browser-based solutions.

Option 3: Use HTML + CSS + JavaScript

The idea: If our users are interacting with the tool in their browser, let's use the things that the browser is most known to be good at: HTML + CSS + JavaScript to generate the contents of our video. Here's an example of a simple scene that we could create.

This example shows how we can animate various properties of DOM elements using CSS or JavaScript.

In order to create the actual mp4 file, we would need to use a browser-automation tool, like Puppeteer. How this would work:

Instantiate a browser tab through Puppeteer and navigate to the page where the video preview is hosted
Tell the video preview page to seek to frame 0
Capture a screenshot using the Puppeteer API
Repeat steps 2 and 3 with all frames in the video
Stitch together the frames into an mp4 using a video utility like ffmpeg

Interestingly, there are some projects that are using this very approach. You can see the source code in a really cool project called Remotion that is doing exactly this.

The main problem: For us, we realized that we wanted even more control over the specific pixels being rendered than the normal DOM API would give us. For example, the DOM API makes animating an entire tag worth of text pretty simple - our codepen above does exactly that. But when it comes to controlling how the glyphs within the text are rendered and animated, the DOM API's support starts to drop off.

Additionally, we were concerned about the scalability and performance of this solution. Assuming that there are potentially thousands of DOM nodes all with various properties being simultaneously animated in one of our videos, this option began to make less sense.

Option 4: WebGL

The idea: This is a similar idea to Option 3, but instead of creating the video contents out of DOM nodes, we could instead use a canvas element and WebGL APIs. This would give us complete control over every pixel, and would also give us more flexibility to optimize performance on our own (without relying on the DOM API.

Of course, it's not all roses, there are still lots and lots of challenges with WebGL. A couple that come to mind:

WebGL is complex. In fact, the first line in the MDN docs on WebGL best practices literally start with: "WebGL is a complicated API, and it's often not obvious what the recommended ways to use it are." So getting to know this new paradigm takes a lot of work and learning.
More control over pixels means that there are more ways to mess things up! Preventing unexpected performance regressions and preventing unexpected visual regressions are two very challenging areas.
Getting consistent results on different platforms is more challenging than you'd first expect. As we are integrating more directly with our users' systems, it's up to us to maintain good compatibility when there's no browser between us and the user's hardware to smooth things out.

Conclusion

For us, WebGL presented a solid solution to our rendering problem. We've now worked with it for multiple years, and learned a lot. It's exciting to be on the cutting edge of web-based graphics, marrying the two disciplines of web engineering and computer graphics development.

If you're interested in this space, add me on Linkedin, I'd love to chat!

Footnote: There are other, legal and licensing considerations that also went into this decision, but in this context it's more interesting to discuss the technical tradeoffs, so I've left those out ;)